Extracting Structured Data from Recipes Using Conditional Random Fields

jawns · on April 9, 2015

"But there is an ever-increasing appetite from developers and designers for finely structured data to power our digital products and at some point, we will need to develop algorithmic solutions to help with these tasks."

One really cool area in which this sort of algorithm could be applied is identifying location data.

Imagine an algorithm that could scan through a Times story like this one ...

http://query.nytimes.com/gst/fullpage.html?res=9902EFDE1230F...

... and extract from the text all location identifiers, then geocode them:

"Seventh Avenue and 36th Street" --> 40.7522877,-73.9897059

"Bleecker Street between Sullivan and Thompson" --> 40.728887,-73.999566

"Chrystie and Rivington" --> "40.7212581,-73.99224"

I used to work for a metro daily, and I developed a script that allowed us to geocode an address by highlighting it in our CMS and clicking a button, but that still required an editor to highlight the correct portion of the text.

Using an algorithm to perform the task instead of an editor would open up some incredible possibilities.

For instance, imagine a local news alerts service in which you could enter your location and a radius, and receive alerts whenever a news item mentioned a location within that radius. (I once developed a prototype of such a service, but the lack of a fully automated process for identifying locations led me to shelve it.)

discardorama · on April 9, 2015

> One really cool area in which this sort of algorithm could be applied is identifying location data.

You may want to try "PlaceSpotter" from Yahoo: https://developer.yahoo.com/boss/geo/

I haven't tried it myself, but did look at it for a similar idea a while back.

follower · on April 10, 2015

> extract from the text all location identifiers

That's what one of MetaCarta's products does: http://en.wikipedia.org/wiki/MetaCarta

denimboy · on April 9, 2015

Weird since I just read this yesterday about the LA times doing the same thing:

http://datadesk.latimes.com/posts/2013/12/natural-language-p...

The LA Times article has some generic python NLTK code. They used a MaxEnt classifier instead of CRF.

jsankey · on April 9, 2015

This is really interesting to me as I've just been solving the same ingredient parsing problem in my iOS app (Zest Recipe Manager) to implement smart shopping lists. Although I was tempted to use a statistical approach I opted to start with a more direct heuristic approach to see how far I could get (and to make sure I really understood the issues before trying a more generic solution).

The heuristic approach actually works pretty well, though with a significant amount of effort! A lot of ambiguities can be resolved with a custom algorithm of this kind. For shopping list support (where really the common cases matter most) the results are excellent. But there are ambiguities I have had to hack solutions to that would probably be better resolved via a probabilistic method. And there are cases where some actual NLP is required to properly detect extraneous descriptive phrases etc. I'm considering adding a statistical helper to my custom parser to take it to the next level.

sheraz · on April 9, 2015

Funny enough, I'm also working in this space at the moment.

Right now we are training models to identify cuisines and diets in multiple languages.

Also, anyone interested in this space might also check out Yummly (www.yummly.com).