Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SpaCy 2.0 released (github.com/explosion)
231 points by nl on Nov 8, 2017 | hide | past | favorite | 32 comments


Demos: https://demos.explosion.ai/displacy-ent/

The English neural network named entity model is a huge improvement over the v1 model. However, the training data is still all from 2010, so it makes some notable errors. We're working on improved training data, using our annotation tool Prodigy (https://prodi.gy).

The NER for the other languages is trained on "silver standard" data from Wikipedia, so the quality is much less consistent, especially if you're working with social media text or "chat bot"-type inputs.


I've tried a simple sentence

> How are you this fine morning?

and didn't recognize anything - I was expecting "you" and "(this) morning" to be highlighted - but perhaps I misunderstood the example.

This https://demos.explosion.ai/displacy/ looks really nice though. Just the UX of horizontal scrolling using mouse is horrible (once you figure it out).


What’s the difference between SpaCy and Spacey? SpaCy recognizes the word “no.”


I can't begin to tell you how excited I am about this. I love v1, and can't wait to finally use v2.

Thanks to you and your team for all the hard work!


Amazing tool! Thanks for the effort!

Curious enough, did u guys train your model on your own dataset or public dataset? Didn't find too much information about the specifics in the documentation


I'm a bit confused as to what SpaCys revenue model is. Didn't it use to be free community edition / paid enterprise support model?


The very first release was under a dual AGPL/commercial license. This was bad. It prevents open-source developers from building on top of it, and it discourages people from getting in touch.

We bootstrapped the company by doing consulting, and now we're releasing products adjacent to spaCy. We've had a great response to our annotation tool Prodigy, which is currently in free beta: https://prodi.gy .

The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.

We're looking forward to releasing some other offerings alongside spaCy. We don't like to say too much because timelines are tough --- we don't want to release something half-finished to stick to a schedule. The Explosion AI mailing list is the best way to stay in the loop.


> The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.

Can you be more precise ? Because I don't want to invest time in a tool to discover that I can't afford it months later ( having a research student budget and all, i.e my "budget" is my own money ).

:(


The license for an individual developer will be a few hundred dollars --- sorry for the vagueness. We'll be ready to release official pricing soon.

For research students, we think your institution should be covering you! We'll be offering an academic subscription, so research institutions can pay a yearly flat fee to have all staff and students covered.


For me, the most important thing about this version is the reduced memory usage. Previously the smallest english model took 1GB of RAM, making it troublesome to run it on any cloud instances. If v2 is to take ~200mb instead, that's a huge improvement.


The thing that always bothered me about v1 was that it was fast, but in many ways not that scaleable. I really under-estimated the importance of Pickle support for instance, because I didn't appreciate that that's how multiprocessing works in Python.

You might find this method particularly useful for meeting memory constraints: https://spacy.io/api/vocab#prune_vectors . This lets you reduce a large word vectors table to a small one by remembering the nearest neighbours for the words you prune out. So if you have a rare word like 'biophysicist', you can map it to the vector for a word like 'scientist', and get a close-enough word vector for it.


Does that mean that it can run on a Raspberry Pi?


Unless RAM usage hasn't significantly increased beyond 200mb since alpha, it should run.


This is great. I really really hope they have a stable and big enough source of revenue to keep the development going.


Thank you for providing such a great tool, I'm excited to try version 2.0. I've also played around with Prodigy. SpaCy was my start in NLP, I really hope it is going to stay around.

We've developed a great product for our customer with SpaCy, it wouldn't be possible without SpaCy.


Thanks to @syllogism for spacy. It’s one of those tools which make Python the go to language for NLP.


Congrats! Have been following SpaCy since it was first discussed/argued here on HN. I haven't had much reason/imagination to use NLP in work but I frequently recommend it to students as most of their curriculum is centered around old versions of NLTK.


This is awesome! I've been meaning to get into NLP / Computer Linguistics for a while now.

Can anybody share what kind of projects you're doing that benefit from SpaCy? Do you use it as-is or do you build on top of it?


Kudos to Matthew, Ines and others making this possible.

I haven't checked it out myself yet, so I wanted to ask that are the performance issues fixed that were haunting the 2.0 alpha version?


Current discussion: https://github.com/explosion/spaCy/issues/1508

I'm getting around 8k words per second on the smallest Google Cloud instances. You couldn't run spaCy 1 on these instances (or on AWS lambda) due to memory usage problems, especially problems predicting memory usage for long-running processes. This is why we say spaCy 2 is cheaper to run in a cents-per-word sense than spaCy 1. This is the performance measure that we think is most important.

However, users are still reporting performance problems, so I wouldn't call the issue resolved. spaCy 1 managed to avoid depending on numpy during prediction, making it easy to ensure that performance didn't depend on anyone's environment. spaCy 2 currently does use numpy, introducing these questions around configuration. I'm working to fix this by implementing the forward pass entirely in Cython.


Found the answer myself from the release docs: > The Language.pipe method allows spaCy to batch documents, which brings a significant performance advantage in v2.0. The new neural networks introduce some overhead per batch, so if you're processing a number of documents in a row, you should use nlp.pipe and process the texts as a stream.

So if you have an event based system where you can process only a single document at once, it does not make sense to upgrade yet, because for a single document case the runtime performance was 10x-100x slower, at least with 2.0 alpha version.


But with a nice caveat: In an event-based system, you can run spaCy 2 with AWS Lambda :). This will be much cheaper than keeping a server warm.


Really nice work. Is there a bitcoin or PayPal donation page for the spaCy project?


We actually don't believe in soliciting donations. Ines explains our thinking here: https://ines.io/blog/spacy-commercial-open-source-nlp#moneti...

Basically: donations can only be made from personal funds, but most of the benefits from the software will go to commercial users. That's a pretty lopsided dynamic.


Aside: Ines appears to be working on something called Prodigy [1] which seems close to what I imagined would be a "Killer App" after playing around with SpaCy. I look forward to hearing more about it as it matures.

[1] https://prodi.gy/


The only question remains is: does it _really_ matter?


Great work - big fan! :)


Do you have any plans for Korean support?


We're definitely interested in Korean support. I hope we can get some contributions for this in the next few months.

My understanding is that there are actually some very good Python libraries for Korean NLP? It's now much easier to provide annotations via another library. This is how the Chinese and Japanese support is working at the moment. We'll add "native" models for all of these languages, but for now you might want to wrap some of these resources: https://github.com/datanada/Awesome-Korean-NLP


See also:

https://github.com/crownpku/Awesome-Chinese-NLP

Very much looking forward Chinese support in SpaCy.


Thanks!


Excellent update, great work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: