SpaCy 2.0 released

syllogism · on Nov 8, 2017

Demos: https://demos.explosion.ai/displacy-ent/

The English neural network named entity model is a huge improvement over the v1 model. However, the training data is still all from 2010, so it makes some notable errors. We're working on improved training data, using our annotation tool Prodigy (https://prodi.gy).

The NER for the other languages is trained on "silver standard" data from Wikipedia, so the quality is much less consistent, especially if you're working with social media text or "chat bot"-type inputs.

rplnt · on Nov 8, 2017

I've tried a simple sentence

> How are you this fine morning?

and didn't recognize anything - I was expecting "you" and "(this) morning" to be highlighted - but perhaps I misunderstood the example.

This https://demos.explosion.ai/displacy/ looks really nice though. Just the UX of horizontal scrolling using mouse is horrible (once you figure it out).

ChristianGeek · on Nov 8, 2017

What’s the difference between SpaCy and Spacey? SpaCy recognizes the word “no.”

JPKab · on Nov 8, 2017

I can't begin to tell you how excited I am about this. I love v1, and can't wait to finally use v2.

Thanks to you and your team for all the hard work!

tanilama · on Nov 8, 2017

Amazing tool! Thanks for the effort!

Curious enough, did u guys train your model on your own dataset or public dataset? Didn't find too much information about the specifics in the documentation

wodenokoto · on Nov 8, 2017

I'm a bit confused as to what SpaCys revenue model is. Didn't it use to be free community edition / paid enterprise support model?

syllogism · on Nov 8, 2017

The very first release was under a dual AGPL/commercial license. This was bad. It prevents open-source developers from building on top of it, and it discourages people from getting in touch.

We bootstrapped the company by doing consulting, and now we're releasing products adjacent to spaCy. We've had a great response to our annotation tool Prodigy, which is currently in free beta: https://prodi.gy .

The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.

We're looking forward to releasing some other offerings alongside spaCy. We don't like to say too much because timelines are tough --- we don't want to release something half-finished to stick to a schedule. The Explosion AI mailing list is the best way to stay in the loop.

JustFinishedBSG · on Nov 8, 2017

> The license model for Prodigy is pretty simple: permanent per-seat licenses, with pricing that compares pretty favourably to other developer tools.

Can you be more precise ? Because I don't want to invest time in a tool to discover that I can't afford it months later ( having a research student budget and all, i.e my "budget" is my own money ).

:(

syllogism · on Nov 8, 2017

The license for an individual developer will be a few hundred dollars --- sorry for the vagueness. We'll be ready to release official pricing soon.

For research students, we think your institution should be covering you! We'll be offering an academic subscription, so research institutions can pay a yearly flat fee to have all staff and students covered.

kamac · on Nov 8, 2017

For me, the most important thing about this version is the reduced memory usage. Previously the smallest english model took 1GB of RAM, making it troublesome to run it on any cloud instances. If v2 is to take ~200mb instead, that's a huge improvement.

syllogism · on Nov 8, 2017

The thing that always bothered me about v1 was that it was fast, but in many ways not that scaleable. I really under-estimated the importance of Pickle support for instance, because I didn't appreciate that that's how multiprocessing works in Python.

You might find this method particularly useful for meeting memory constraints: https://spacy.io/api/vocab#prune_vectors . This lets you reduce a large word vectors table to a small one by remembering the nearest neighbours for the words you prune out. So if you have a rare word like 'biophysicist', you can map it to the vector for a word like 'scientist', and get a close-enough word vector for it.

Vaskivo · on Nov 8, 2017

Does that mean that it can run on a Raspberry Pi?

kamac · on Nov 8, 2017

Unless RAM usage hasn't significantly increased beyond 200mb since alpha, it should run.

ashish01 · on Nov 8, 2017

This is great. I really really hope they have a stable and big enough source of revenue to keep the development going.

arrmn · on Nov 8, 2017

Thank you for providing such a great tool, I'm excited to try version 2.0. I've also played around with Prodigy. SpaCy was my start in NLP, I really hope it is going to stay around.

We've developed a great product for our customer with SpaCy, it wouldn't be possible without SpaCy.

nl · on Nov 8, 2017

Thanks to @syllogism for spacy. It’s one of those tools which make Python the go to language for NLP.

danso · on Nov 8, 2017

Congrats! Have been following SpaCy since it was first discussed/argued here on HN. I haven't had much reason/imagination to use NLP in work but I frequently recommend it to students as most of their curriculum is centered around old versions of NLTK.

halfdan · on Nov 8, 2017

This is awesome! I've been meaning to get into NLP / Computer Linguistics for a while now.

Can anybody share what kind of projects you're doing that benefit from SpaCy? Do you use it as-is or do you build on top of it?

pqwEfkvjs · on Nov 8, 2017

Kudos to Matthew, Ines and others making this possible.

I haven't checked it out myself yet, so I wanted to ask that are the performance issues fixed that were haunting the 2.0 alpha version?

syllogism · on Nov 8, 2017

Current discussion: https://github.com/explosion/spaCy/issues/1508

I'm getting around 8k words per second on the smallest Google Cloud instances. You couldn't run spaCy 1 on these instances (or on AWS lambda) due to memory usage problems, especially problems predicting memory usage for long-running processes. This is why we say spaCy 2 is cheaper to run in a cents-per-word sense than spaCy 1. This is the performance measure that we think is most important.

However, users are still reporting performance problems, so I wouldn't call the issue resolved. spaCy 1 managed to avoid depending on numpy during prediction, making it easy to ensure that performance didn't depend on anyone's environment. spaCy 2 currently does use numpy, introducing these questions around configuration. I'm working to fix this by implementing the forward pass entirely in Cython.

pqwEfkvjs · on Nov 8, 2017

Found the answer myself from the release docs: > The Language.pipe method allows spaCy to batch documents, which brings a significant performance advantage in v2.0. The new neural networks introduce some overhead per batch, so if you're processing a number of documents in a row, you should use nlp.pipe and process the texts as a stream.

So if you have an event based system where you can process only a single document at once, it does not make sense to upgrade yet, because for a single document case the runtime performance was 10x-100x slower, at least with 2.0 alpha version.

syllogism · on Nov 8, 2017

But with a nice caveat: In an event-based system, you can run spaCy 2 with AWS Lambda :). This will be much cheaper than keeping a server warm.

mark_l_watson · on Nov 8, 2017

Really nice work. Is there a bitcoin or PayPal donation page for the spaCy project?

syllogism · on Nov 8, 2017

We actually don't believe in soliciting donations. Ines explains our thinking here: https://ines.io/blog/spacy-commercial-open-source-nlp#moneti...

Basically: donations can only be made from personal funds, but most of the benefits from the software will go to commercial users. That's a pretty lopsided dynamic.

wyldfire · on Nov 8, 2017

Aside: Ines appears to be working on something called Prodigy [1] which seems close to what I imagined would be a "Killer App" after playing around with SpaCy. I look forward to hearing more about it as it matures.

[1] https://prodi.gy/

heavenlyblue · on Nov 8, 2017

The only question remains is: does it _really_ matter?

alexcnwy · on Nov 8, 2017

Great work - big fan! :)

sho_hn · on Nov 8, 2017

Do you have any plans for Korean support?

syllogism · on Nov 8, 2017

We're definitely interested in Korean support. I hope we can get some contributions for this in the next few months.

My understanding is that there are actually some very good Python libraries for Korean NLP? It's now much easier to provide annotations via another library. This is how the Chinese and Japanese support is working at the moment. We'll add "native" models for all of these languages, but for now you might want to wrap some of these resources: https://github.com/datanada/Awesome-Korean-NLP

est · on Nov 8, 2017