OpenRefine – free, open source, powerful tool for working with messy data

vcdimension · on March 23, 2020

You can do similar stuff using the visidata command line tool: https://www.visidata.org/

You can use python code for more advanced data manipulations and creating plugins.

rasmusei · on March 23, 2020

Wow, that is a nice little tool. Just installed it and tested on some random files in my current data analysis project.

By the way, I installed it using pipx https://github.com/pipxproject/pipx by running `pipx install visidata`. To also read HDF and Excel files, I added the necessary packages by running `pipx inject visidata h5py openpyxl`.

thadguidry · on March 23, 2020

Thanks @rasmusei! If you are a data scientist you might also be interested in how to work along with Jupyter. Our community has some documentation on our Wiki about that here: https://github.com/OpenRefine/OpenRefine/wiki/Jupyter

cstuder · on March 23, 2020

Formerely known as Google Refine.

The history of the rename can be found in the blog: https://openrefine.org/blog/2013/10/12/openrefine-history.ht...

riedel · on March 23, 2020

Formerly known as Freebase Gridworks as also mentioned in the article you linked

Chris2048 · on March 23, 2020

Anyone used this before?

My own experience: I had a lot of data to process, which I thought was the use case for a tool like this: but it took a long time, and seemed to have to process the data in order to ingest it properly.

What underlying storage/tech is used? Is it all just web-stack?

OliverJones · on March 23, 2020

It's been my go-to tool for almost a decade for figuring out what's in strange files, and working out how to ingest them. I learned about it from ProPublica.com; they used it in their series about payola to doctors from big pharma.

https://www.propublica.org/nerds/using-google-refine-for-dat...

It's Java. You can mess with the virtual-memory jvm parameters to make it handle more data.

It has a disadvantage in the workplace: it's hard for some managers to believe it's local only. The name Google on it means, to them, that it must upload data to cloud servers owned by Google. Maybe the rebranding has reduced that hurdle.

febeling · on March 23, 2020

yes, the size of the data you can process is limited. I had reasonable results with something like 50MiB, after tweaking jvm parameters.

One limitation I find annoying is that you can't easily strip your transformation from the data and apply it to similar but different data. It's all together "a project."

So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.

I used it to look at a json export from a SaaS tool, and to convert it to table structure. Cleaning field contents, which where following certain conventions, but which evolved over time, things like that. For such use-cases it's powerful.

_frkl · on March 27, 2020

> So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.

Agreed, this would be the one feature I would really need. Would be nice to be able to setup (and refine over time) pipelines to automatically clean up new data.

thadguidry · on March 23, 2020

Our current architecture is here: https://github.com/OpenRefine/OpenRefine/wiki/Architecture

YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html

ratnakar007 · on March 23, 2020

Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help. You can check my github: https://github.com/rmalla1/OpenRefine-HD

Chris2048 · on March 23, 2020

before you go down the spark route, consider perl/unix-tools may do this kind of thing quite well: https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...

thadguidry · on March 23, 2020

That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433

Symbiote · on March 23, 2020

It's Java, but I don't know what's underneath that.

There web browser connects to a local server. You can increase the RAM allocated to the JVM, and I'm fairly sure I've worked on hundreds of thousands of rows, although I don't remember the operations I was using.

larrydag · on March 23, 2020

I've used it sparingly but recognize that I should be using it more. It is really powerful for cleaning up messy data. I'm not sure of the stack but I did reach out to one of the developers about being on HackerNews. Hopefully they can respond.

thadguidry · on March 23, 2020

Thanks Larry for the ping. Happy to answer questions from the community here!

easygenes · on March 23, 2020

I have used and loved this since it was a project from MIT CSAIL SIMILE (circa 2006).

easygenes · on March 23, 2020

Follow up: Looking over the old SIMILE site I couldn't find the original project. Also David Huynh didn't mention it in his own website, but some searching yielded the original project, "Parallax"

https://books.google.com/books?id=Y_FZPtpgntwC&pg=PA36

More from the era: https://blog.jonudell.net/2008/08/25/motivating-people-to-wr...

thadguidry · on March 23, 2020

SIMILE library is used in OpenRefine for certain Faceting like timeline, clustering, etc. Parallax was originated by David to show how time series data visualizations could be enhanced. David was one of our original designers of OpenRefine and I worked closely with him and Stefano in testing it.

canada_dry · on March 23, 2020

Reminded me a bit of the 'data wrangler' tool from Stanford. It was* a fantastic tool for dealing with messy data.

http://vis.stanford.edu/wrangler/

*it's now a commercial product maintained by Trifacta (https://www.trifacta.com/start-wrangling/)

chrisweekly · on March 23, 2020