Hacker News new | past | comments | ask | show | jobs | submit login
Announcing Google Refine 2.0, a power tool for data wranglers (google-opensource.blogspot.com)
148 points by Anon84 on Nov 10, 2010 | hide | past | favorite | 21 comments



So I'm on ProPublica's web team -- the organization mentioned in the first video -- and we deal with the types of messy data Refine is made for on a day to day basis.

We've been using it pretty much daily for about 5 months now and cleaning messy government data used to be time consuming and destructive, with google Refine it's so easy and fast to join, cleanup and do rudimentary analysis on said data.

It especially shines when you have to merge many disparate data sets into one. My colleague, Dan Nguyen, did just that for our Dollars for Doctors app:

http://projects.propublica.org/docdollars/

and he scraped the data from reports like this:

http://www.pfizer.com/responsibility/working_with_hcp/paymen...

(one company even put the disclosures up as a flash movie).

Of course we could write scripts, use grep/awk/sed or import it into a database, but Refine is really it. I encourage you to give it a try if you have questionable data you'll need to clean.


Would you and your team be willing to write up a quick intro or howto or article about how you're using the tool? Some real-life scenarios and examples might be very useful.

Also, you guys do great work - keep it up!


I think we have 2 posts in the hopper about it, keep an eye on the nerd blog:

http://www.propublica.org/nerds


Thanks for your work.

How did you deal with the Flash content? Decompile the source code? Did you encounter tabular PDF data? If so, did you find a good solution?

Also, have you or your colleagues had any contact with the Wolfram Alpha team? It seems like your organizations have similar data curation goals.

http://blog.stephenwolfram.com/2010/10/the-emerging-computat...


I didn't grab the flash content, but if I remember correctly, it was a flash movie that wrapped a PDF that Dan then OCRed and cleaned up with Refine. The coolest part was that the pdf was in grid form, so Dan wrote an ImageMagick script that split it into individual cells and then OCRed each cell (for better results).

EDIT: We haven't had any contact with Wolfram|Alpha but maybe we should reach out.


How do you guys make money?

Edit: I see that you are a nonprofit organization from your "About" page.


Yeah, non-profit, mostly foundation grants.


I can now finally see a glimpse of a bright future where all my ID3 tags are rationalized.

I never thought this day would come.


This is huge for me. I manage an inventory system for several government contractors and you'd be amazed at the thousands and thousands of inconsistencies you can find. Sometimes it takes days, and on one occasion, two weeks to completely sanitize them.

After a quick trial with this, I'm sold. This is truly amazing for people with similar jobs such as mine.


This is pretty neat, but it seems like an advanced version of Google Docs Spreadsheets. I wonder if they'll roll these features into that. The OSS project is nice for confidential data, but I think a lot of people would use a hosted version. Anyone going to set one up?

I literally just did the same exercise as demonstrated in the second video, parsing a Wiki document (the list of world religions) from Wikipedia. But it took about 30 lines of PHP.

Maybe they'll add the ability to import a web page as a data source, and export the script that does the transformations as a python script?

Okay. I'm rambling...


I haven't used some of the cooler import features of Google Docs Spreadsheets, but my guess is that Refine will be better if

a) you are not skilled in a language that handles text easily (I have friends who might have a need to aggregate that data but would have no clue how to begin writing a program to parse it).

b) don't have a Peter Norving-like facility with algorithms: http://norvig.com/ngrams/ i.e. you could code the transformation given sufficient time, but Refine would be faster, both in development and, for large data sets, run time.


You can grab data from a web api, for example, geocoding:

http://code.google.com/p/google-refine/wiki/Geocoding


Wow this is amazing! In the real world data can be messy and this looks like a great tool to transform it without an extensive custom ETL process that requires code


I think I'm in love!

When I saw this video last week I started thinking if there was any data munging that I've been putting off. There is, but I need to do some scraping first.

In the meantime, I sent it to a friend who is working on an iPhone app that draws on a government database. He was thrilled to have something more interactive and productive (in his perception, at least) than python + excel.


It's nice that they've done this, because it makes powerful data operations available to non-programmers. I'll be sticking with my Unix command line tools, though.


Honest question: how do you do clustering with unix tools?


speaking of google, is it just me or did they just introduce a brand new 'preview' button (with pop-up) beside their search results? that wasn't there before right?


HN thread on Google previews: http://news.ycombinator.com/item?id=1892152


They did yesterday, if this is what you're talking about: http://techcrunch.com/2010/11/09/google-instant-previews/


Google just went plaid for a lot of people.


Spaceballs reference? Ludicrous speed!

Colonel Sandurz: We can't stop, it's too dangerous! We have to slow down first! http://www.imdb.com/title/tt0094012/quotes

Google and the Spaceballs seem to have a lot in common.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: