Hacker News new | past | comments | ask | show | jobs | submit login
Prodigy: A new tool for radically efficient machine teaching (explosion.ai)
286 points by Young_God on Aug 4, 2017 | hide | past | favorite | 69 comments



I don't think (some) people understand; a slick data annotation tool like this is vastly more useful than the 20th variant of GAN that DeepMind produces :)


Totally, I think people have this weird sense of entitlement when it comes to high-quality datasets without the commensurate respect for how they're created or the level of effort that goes into them.

Fei-Fei Li gives a good sense for this in her history of ImageNet [1][2].

[1] https://qz.com/1034972/the-data-that-changed-the-direction-o...

[2] http://image-net.org/challenges/talks_2017/imagenet_ilsvrc20...


Totally agree. Annotation effort can explode out of control if you don't have good tooling. Well done, Explosion!


Looks promising and definitely a needed tool. I signed up for the beta and I used the demo version and have a couple of thoughts.

1. This seems closer to a reinforcement learning system than a pure annotation system. That seems to be by design, however based on the demo, I am not able to change or add to the annotations as I go, which is a big limitation. It's just yes, no (no feedback), ignore and undo. This is in contrast to something like the VGG annotations system: http://www.robots.ox.ac.uk/~vgg/software/via/via.html

2. I don't see an actual annotations capability for images in the demo. Not sure if that is just a pretotype page, but IMO image classification/segmentation is the place where this tool would really benefit the community.

3. It's unclear to me how or if I retrieve my trained model or even just the annotated structure (.csv?, .json?) from this system. Do I get a .pb somehow that I can import into TF or am I locked into an API with my new model served from Prodigy? My guess would be the latter.

I think what this wants to be is a human validation system for training, which also improves the Prodigy nets through crowd sourcing. Definitely a win-win in the short term, but it has the limitations of the initial model and the ability for the user/client to tweak the system and output the results.

Matroid is doing something similar here, but I have been unimpressed with their offering so far.


Thanks for the engaging questions! Reading between the lines, I think there's an important point that hasn't come across. Prodigy isn't SaaS --- it's a library you download and run. You can extend and customise every aspect of it, and there's definitely no lock-in. The model (and annotations) never have to leave your servers.

For the specific questions:

1. The built-in web views all have binary annotation interfaces. This is more of a design choice than a fundamental limitation, and the front-end is extensible --- you can add your own web views if you need to.

The binary interface is sort of a position statement. We think this is The Way, so we want you to try it. We'll have more input components in future, but at the start we want to guide people towards the intended workflow.

2. The beta focuses on NLP support, but there's a front-end for image classification, and a workflow page: https://prodi.gy/docs/workflow-image-classification .

3. You can usually get some accuracy improvement by retraining once all the annotations are available. I've not found a streaming SGD algorithm that works as well as the simple iterate-and-shuffle batch process. Batch training also lets you tune the hyper-paramers. You can read more about this here: https://prodi.gy/docs/workflow-named-entity-recognition#trai...

I would suggest writing a Prodigy recipe to do the batch training. That way you can pass in the dataset ID, instead of exporting the annotations. There's no problem with exporting the annotations and running a script, though. Again --- it's all on your computer. You can run it however you like.


I think there's an important point that hasn't come across. Prodigy isn't SaaS --- it's a library you download and run.

You're right, I totally missed that. Re-reading that comes through but it is definitely different than what I would have expected.

Thanks for the response, I'll dig in further.


I love how the SpaCy related websites are always so well designed. Their dependency graph visualizer is just amazing. I know that Ines is behind that one, but don't know about the other stuff.

Now coming back to the topic, I have so far just used Jupyter Notebooks and spreadsheets to do annotations and by golly, it is an extremely boring and tedious process. This looks like a fun tool to try out for my next NLP related project. Might spice things up!

But I hope that like all SpaCy related ideas, it doesn't assume too much about the problem at hand. I usually use NLTK instead of SpaCy because it allows me to be very flexible, except for the sentence tokenizer, where SpaCy's accuracy is hard to beat.


Explosion is just me and Ines -- so yes, the pages are all made by Ines (with the great illustrations by Frederique Matti). Ines also wrote the bulk of the code for Prodigy itself.


Hats off to you Ines and Frederique. You guys do really great work.


So this isn't OSS? Seems atypical in the ML community.

For those looking for alternative OSS solutions: BRAT, labellmg are decent.


Guess the radical efficiency didn't carry over to their web server


> spaCY, the leading open-source NLP tool?

Sounds like marketing BS. what about OpenNLP and Stanford's for NLP?


Anecdotally, I got better NER results with spaCY than OpenNLP and CoreNLP with their respective default models, and spaCY was easier to install (though I'm biased, being more familiar with Python tooling and documentation style). I was eventually implementing in Java so I did use OpenNLP for sentence splitting, but I retrained the NER with data bootstrapped from spaCY, in a way similar to what the Prodigy tool is aiming to facilitate, by first classifying using the default/vanilla model and then manually correcting labels where they were incorrect.


They have some hard comparisons in some of their earlier blog posts on how spacy compares to the other popular open source NLP libraries. In my experience it has been much easier to use and faster than things like Stanford's library or NLTK. In general it's aimed at production or commercial use, whereas the other libraries I typically hear mentioned are aimed at a more academic audience.


OpenNLP is.... well I've never heard of anyone using it (except once in a ensemble). I think NLTK is more widely used.

Stanford CoreNLP give good accuracy and is pretty much the benchmark in English for accuracy. BUT it isn't great software. It falls over if you pass large amounts of text to it, the code is dreadful, it's hard to integrate (even in Java because of its own wacky config system), various parts aren't integrated (eg, SUTime), it doesn't have an embedding representation and it is pretty slow.

Having said all that I still use it sometimes. But Spacy is much nicer to use, and 99% (probably more) of the time the slightly lower accuracy is offset by things like the easy availability of word embedding right with the word tags.

I think it's pretty fair to say Spacy is the leading open-source NLP tool.


This is a product from the same devs as spaCy, Explosion.

https://github.com/explosion


I have to agree. Explosion/Spacy has a lot of Marketing BS. That said I think SpaCy is actually pretty solid and when you are in the NLP field give it a try.


spaCY, the leading open-source NLP tool?

Agreed, the description is definitely cringe-worthy.

As if whoever wrote that wasn't aware that these are language geeks they're marketing to.


Self-respecting language geeks keep up with the times. What's your case for "leading open-source"? Here's a look at Spacy blowing Stanford Core NLP out of the water (via github stars, you can take a look at commits and more from the same tool): https://www.datascience.com/trends?trends=4812,7214,7165&tre...


Actually I don't follow any of these tools closely to know whether they're currently "leading" or not.

It's just that, wording-wise "the leading open source X" exudes marketing-speak, which I find language geeks tend to have robust antibodies against.

This kind of lingo work (sort of) for the market, say, MongoDB is in. But for the users of these tools, I suspect not so much.


When a new tool is announced, there is a lot of casual interest. Hence the 'explosion' in the beginning. After some time, things reach steady-state, and you can see that Spacy's interest is starting to fall below Core NLP's in the last few weeks.

In other words: Spacy is sinking.


> spaCY, the leading open-source NLP tool?

* only supports 3 languages, though


That's a nice UX but the flurry of initial upvotes on this looks kinda fishy, especially given that it's just annotation software.


It's interesting to see how this comes across to people who are outside our ML/NLP bubble. I can definitely relate to the feeling of being confused at why something that looks sort of basic is supposedly significant. It was actually very difficult to build, especially the active learning component for the named entity recognition system.


I'm a data scientist and getting annotations for our data is one of our most onerous issues. I upvoted this. If it works well, I could see myself using it all the time. Making a model that gets you most of the way there is the easy part; getting clean, annotated data. Uggh.


Yep agreed, we've had to build similar things internally.

Getting labelled data is a pain.


Looks pretty fishy I agree


This is kinda why we need down-votes for stories, or HN needs better detection of these strategies.


Since syllogism is participating in this thread, what kind of active learning are you using? I'm always hesitant to use anything except for IWAL since most of the more common ones aren't actually consistent. Even then, then payoff tends to be kinda disappointing.

(But I'm definitely not an expert)


Yes, it uses importance weighted active learning. You can set the priorities yourself, but the default built-in sorter is just uses distance from 0.5. There's a random component to help make sure the model doesn't get stuck asking the wrong questions.


Thanks for the answer :)

Great work here, btw. It's refreshing to see emphasis on "label some damned data" and work towards making that easy.


This looks interesting because it add the ability to put the user in the loop of fixing/annotating the problematic observations relatively easily. I like the example of Tinder for data.

Are the examples picked those that have the highest objective function error rate, or something similar?

Does this apply only to text classification problems? Are there examples where this could be applied to tabular data?


This could be a headline from 1987 :-) (cue the dialup modem sound)


Looks very nice, although it always takes me a bit to figure out what they're talking about with these sorts of things because I have to remind myself that most ML/DL stuff is supervised. What I research is unsupervised.

They kind of have this weird dissing of unsupervised scenarios, though. It's not like supervised or unsupervised is better or worse, they're just surrounding different problems. They can talk up their product without needing to criticize a problem domain.

It's like if you were making motors for boats, and then started talking about "these crazy people who think it's better to fly." ???


I see how that came across as obnoxious, so thanks for the perspective.

I do think there's a pretty common failure mode for teams who don't have much experience with ML, though. Teams who don't have much experience with ML often take "We don't have much data" as a parameter of their problem, and don't see that this is something they can decide to change. This can lead to a lot of time spent experimenting with different unsupervised approaches that are a poor fit for what they're trying to do.


How many languages are supported? I see many more languages in Google's Syntaxnet library. What's keeping you from having the same list of 40 languages for POS tagging?

https://github.com/tensorflow/models/blob/master/syntaxnet/g...


The UD treebanks have made it very easy to offer lots POS and dependency parsing models under a CC-by-NC license. We'll be putting up more of these for download as spaCy 2 stabilises.

We're mostly worried about saying we "support" a language when we've just trained a tagger on a UD treebank, though. We like at least having the stop words and tokenizer exceptions filled in by a native speaker, so the usual flow has been that someone needs the functionality, and they make a pull request.

If you just need the UD model for say, Bulgarian, you can do:

    python -m spacy train xx /path/to/output_model /path/to/bulgarian-train.conllu /path/to/bulgarian-dev.conllu --no-entities
We don't have a spacy.bg.Bulgarian language class yet, so you can either add one, or use the multi-language class, which usually works OK.


I've been pulling my hair out and losing sleep over a specific problem I need to solve for a client. This tool, along with the linked spaCy lib have not only reduced the complexity of the task to be manageable, have also drastically reduced the projected completion time. In other words, Holy shit thank you OP.


I like the simpler annotation UI, you can get more of your team active with annotation in a Mechanical Turk fashion.


http://mirror.explosion.ai/blog/prodigy-annotation-tool-acti...

Sorry about the poor performance on the site! We got complacent because all of our sites are 100% static.


To me, Matthew and Ines are to NLP as Bernstein & co are to cryptography.



So this is just fluff?


Nope, not just fluff. New ML model architectures get too much hype, while it's relatively simple tools like this that actually make the difference in whether or not ML can applied to industry problems. The low hanging fruit in the ML industry are in workflow tools rather than novel model architectures. I have a huge amount of respect for the folks at explosion.ai, largely because their solutions are consistently good in practice rather than good in theory.


You might be interested in Deep Video Analytics, its a Visual Data Analytics platform that I am building. [1]

[1] https://github.com/AKSHAYUBHAT/DeepVideoAnalytics


Exactly. I'm working on something related: building a UI on top of declarative ETL pipelines to drive ML models. I think a lot of time (and big data resources) can be saved.


It looks like it was first posted to redit 40 minutes ago.

It looks like it is online during typing of annotation, trying to predict annotations.

When it says teaching, it means teaching the AI. When it says "radical" it means ... getting slightly more data input, and in an online manner.


No, I don't think so. People tend to over-emphasize the latest ML techniques when the greatest improvements right now come from better and cleaner data sets. I am interested in this sort of thing because we were just about to try and build something like it ourselves.


I work with data as a neuroscientist, but I haven't used ML. What is an an annotation in this context?


Adding mark-up, basically.

Examples for text data: topic tags, marking mentions of companies, finding descriptions of protein interactions. For images people mostly do segmentation (finding object boundaries) and classification.

If the model is generating images or text, you usually need to do annotation to evaluate the model. For instance, you need to know whether the translations are grammatical, whether an image is coloured correctly, etc.


I have an NLP bot as a hobby, but use old-fashioned statistics instead of ML. It looks like annotations here mean manually training the bot by seeding the learning data with hints.


A utility for labeling training data for supervised learning then?


Yes, with a bit of a twist. As I understand it, it'll keep retraining the model and asking you to label the examples it's least sure about. This is a lot faster and better than randomly labelling your data or trying to do it all.


Wow, been a while since I've touched UX this good, loved the themes, is this open source?


link dead already


Working for me?


working now


Site down?


It's a 100% static site, but Apache is still struggling :(. Should have used a bigger droplet...Sorry!


Should have used nginx, or at least put cloudflare in front of it


you can always just use an s3 bucket


Turn off KeepAlive, if it's on.


Bad advice. That's likely to make it a lot worse.


I disagree with you based on experience, but you don't have to take my word for it. 'patio11 also has had some experience here: http://www.kalzumeus.com/2010/06/19/running-apache-on-a-memo...


Also disagree with you based on experience ;)

patio11 blog is HTTP only but this blog is HTTPS.

HTTPS without keepalive is likely to kill any cheap VPS, establishing HTTPS connection is intensive.

That being said, the core of the issue is that they should use nginx (or apache in mpm-events). And they should have cloudflare.


establishing HTTPS connection is intensive

More intensive than adding numbers together, sure, but computers are pretty fast. If you're doing 100k connections a second you might have to give some thought to that. Meanwhile, if you have KeepAlive on, 2~5 clients per second will kill you.


I guess we're going to just wait here until OP delivers the answer for why the machine couldn't answer enough requests :)


this brings back dialup/bbs memories...


Very wordy.. not very efficient..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: