Hacker News new | past | comments | ask | show | jobs | submit login
OpenRefine – free, open source, powerful tool for working with messy data (openrefine.org)
235 points by _zhqs on March 22, 2020 | hide | past | favorite | 23 comments



You can do similar stuff using the visidata command line tool: https://www.visidata.org/

You can use python code for more advanced data manipulations and creating plugins.


Wow, that is a nice little tool. Just installed it and tested on some random files in my current data analysis project.

By the way, I installed it using pipx https://github.com/pipxproject/pipx by running `pipx install visidata`. To also read HDF and Excel files, I added the necessary packages by running `pipx inject visidata h5py openpyxl`.


Thanks @rasmusei! If you are a data scientist you might also be interested in how to work along with Jupyter. Our community has some documentation on our Wiki about that here: https://github.com/OpenRefine/OpenRefine/wiki/Jupyter


Formerely known as Google Refine.

The history of the rename can be found in the blog: https://openrefine.org/blog/2013/10/12/openrefine-history.ht...


Formerly known as Freebase Gridworks as also mentioned in the article you linked


Anyone used this before?

My own experience: I had a lot of data to process, which I thought was the use case for a tool like this: but it took a long time, and seemed to have to process the data in order to ingest it properly.

What underlying storage/tech is used? Is it all just web-stack?


It's been my go-to tool for almost a decade for figuring out what's in strange files, and working out how to ingest them. I learned about it from ProPublica.com; they used it in their series about payola to doctors from big pharma.

https://www.propublica.org/nerds/using-google-refine-for-dat...

It's Java. You can mess with the virtual-memory jvm parameters to make it handle more data.

It has a disadvantage in the workplace: it's hard for some managers to believe it's local only. The name Google on it means, to them, that it must upload data to cloud servers owned by Google. Maybe the rebranding has reduced that hurdle.


yes, the size of the data you can process is limited. I had reasonable results with something like 50MiB, after tweaking jvm parameters.

One limitation I find annoying is that you can't easily strip your transformation from the data and apply it to similar but different data. It's all together "a project."

So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.

I used it to look at a json export from a SaaS tool, and to convert it to table structure. Cleaning field contents, which where following certain conventions, but which evolved over time, things like that. For such use-cases it's powerful.


> So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.

Agreed, this would be the one feature I would really need. Would be nice to be able to setup (and refine over time) pipelines to automatically clean up new data.


Our current architecture is here: https://github.com/OpenRefine/OpenRefine/wiki/Architecture

YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html


Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help. You can check my github: https://github.com/rmalla1/OpenRefine-HD


before you go down the spark route, consider perl/unix-tools may do this kind of thing quite well: https://livefreeordichotomize.com/2019/06/04/using_awk_and_r...


That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433


It's Java, but I don't know what's underneath that.

There web browser connects to a local server. You can increase the RAM allocated to the JVM, and I'm fairly sure I've worked on hundreds of thousands of rows, although I don't remember the operations I was using.


I've used it sparingly but recognize that I should be using it more. It is really powerful for cleaning up messy data. I'm not sure of the stack but I did reach out to one of the developers about being on HackerNews. Hopefully they can respond.


Thanks Larry for the ping. Happy to answer questions from the community here!


I have used and loved this since it was a project from MIT CSAIL SIMILE (circa 2006).


Follow up: Looking over the old SIMILE site I couldn't find the original project. Also David Huynh didn't mention it in his own website, but some searching yielded the original project, "Parallax"

https://books.google.com/books?id=Y_FZPtpgntwC&pg=PA36

More from the era: https://blog.jonudell.net/2008/08/25/motivating-people-to-wr...


SIMILE library is used in OpenRefine for certain Faceting like timeline, clustering, etc. Parallax was originated by David to show how time series data visualizations could be enhanced. David was one of our original designers of OpenRefine and I worked closely with him and Stefano in testing it.


Reminded me a bit of the 'data wrangler' tool from Stanford. It was* a fantastic tool for dealing with messy data.

http://vis.stanford.edu/wrangler/

*it's now a commercial product maintained by Trifacta (https://www.trifacta.com/start-wrangling/)



That is really interesting but I don't see what it has to do with OpenRefine.

Thanks for the link though.


Was "really interesting" sarcastic? If you found lnav to be interesting, I don't understand how you'd fail to see how it's relevant.

lnav is a mini ETL tool, which, like OpenRefine, aids in transforming data from various formats to make it more useful. They're in the same space.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: