The English neural network named entity model is a huge improvement over the v1 model. However, the training data is still all from 2010, so it makes some notable errors. We're working on improved training data, using our annotation tool Prodigy (https://prodi.gy).
The NER for the other languages is trained on "silver standard" data from Wikipedia, so the quality is much less consistent, especially if you're working with social media text or "chat bot"-type inputs.
and didn't recognize anything - I was expecting "you" and "(this) morning" to be highlighted - but perhaps I misunderstood the example.
This https://demos.explosion.ai/displacy/ looks really nice though. Just the UX of horizontal scrolling using mouse is horrible (once you figure it out).
Curious enough, did u guys train your model on your own dataset or public dataset? Didn't find too much information about the specifics in the documentation
The very first release was under a dual AGPL/commercial license. This was bad. It prevents open-source developers from building on top of it, and it discourages people from getting in touch.
We bootstrapped the company by doing consulting, and now we're releasing products adjacent to spaCy. We've had a great response to our annotation tool Prodigy, which is currently in free beta: https://prodi.gy .
The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.
We're looking forward to releasing some other offerings alongside spaCy. We don't like to say too much because timelines are tough --- we don't want to release something half-finished to stick to a schedule. The Explosion AI mailing list is the best way to stay in the loop.
> The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.
Can you be more precise ? Because I don't want to invest time in a tool to discover that I can't afford it months later ( having a research student budget and all, i.e my "budget" is my own money ).
The license for an individual developer will be a few hundred dollars --- sorry for the vagueness. We'll be ready to release official pricing soon.
For research students, we think your institution should be covering you! We'll be offering an academic subscription, so research institutions can pay a yearly flat fee to have all staff and students covered.
For me, the most important thing about this version is the reduced memory usage. Previously the smallest english model took 1GB of RAM, making it troublesome to run it on any cloud instances. If v2 is to take ~200mb instead, that's a huge improvement.
The thing that always bothered me about v1 was that it was fast, but in many ways not that scaleable. I really under-estimated the importance of Pickle support for instance, because I didn't appreciate that that's how multiprocessing works in Python.
You might find this method particularly useful for meeting memory constraints: https://spacy.io/api/vocab#prune_vectors . This lets you reduce a large word vectors table to a small one by remembering the nearest neighbours for the words you prune out. So if you have a rare word like 'biophysicist', you can map it to the vector for a word like 'scientist', and get a close-enough word vector for it.
Thank you for providing such a great tool, I'm excited to try version 2.0. I've also played around with Prodigy. SpaCy was my start in NLP, I really hope it is going to stay around.
We've developed a great product for our customer with SpaCy, it wouldn't be possible without SpaCy.
Congrats! Have been following SpaCy since it was first discussed/argued here on HN. I haven't had much reason/imagination to use NLP in work but I frequently recommend it to students as most of their curriculum is centered around old versions of NLTK.
I'm getting around 8k words per second on the smallest Google Cloud instances. You couldn't run spaCy 1 on these instances (or on AWS lambda) due to memory usage problems, especially problems predicting memory usage for long-running processes. This is why we say spaCy 2 is cheaper to run in a cents-per-word sense than spaCy 1. This is the performance measure that we think is most important.
However, users are still reporting performance problems, so I wouldn't call the issue resolved. spaCy 1 managed to avoid depending on numpy during prediction, making it easy to ensure that performance didn't depend on anyone's environment. spaCy 2 currently does use numpy, introducing these questions around configuration. I'm working to fix this by implementing the forward pass entirely in Cython.
Found the answer myself from the release docs:
> The Language.pipe method allows spaCy to batch documents, which brings a significant performance advantage in v2.0. The new neural networks introduce some overhead per batch, so if you're processing a number of documents in a row, you should use nlp.pipe and process the texts as a stream.
So if you have an event based system where you can process only a single document at once, it does not make sense to upgrade yet, because for a single document case the runtime performance was 10x-100x slower, at least with 2.0 alpha version.
Basically: donations can only be made from personal funds, but most of the benefits from the software will go to commercial users. That's a pretty lopsided dynamic.
Aside: Ines appears to be working on something called Prodigy [1] which seems close to what I imagined would be a "Killer App" after playing around with SpaCy. I look forward to hearing more about it as it matures.
We're definitely interested in Korean support. I hope we can get some contributions for this in the next few months.
My understanding is that there are actually some very good Python libraries for Korean NLP? It's now much easier to provide annotations via another library. This is how the Chinese and Japanese support is working at the moment. We'll add "native" models for all of these languages, but for now you might want to wrap some of these resources: https://github.com/datanada/Awesome-Korean-NLP
The English neural network named entity model is a huge improvement over the v1 model. However, the training data is still all from 2010, so it makes some notable errors. We're working on improved training data, using our annotation tool Prodigy (https://prodi.gy).
The NER for the other languages is trained on "silver standard" data from Wikipedia, so the quality is much less consistent, especially if you're working with social media text or "chat bot"-type inputs.