Four or five years ago, this was a tool I was using almost every day for work. Doing data consolidation and migrations for small nonprofits, we were faced with so many loosely structured excel sheets and CSV exports from various mailing programs. OpenRefine was absolutely instrumental in cleaning up lots of disparate data when the data sources were too many and too variable to make a scripted solution valuable. Glad to see it lives on.
I have been working at a nonprofit and have only recently started using this for cleaning up Excel or CSV files that we want to import. I am not as familiar with doing this with code, but I love that this tool gives me the steps I have taken in case I ever want to audit the changes I made to the data. The one disadvantage I see is that it seems like it’s only for a single user and it might be burdensome to collaborate since you have to share the project file.
I’m still excited to learn more about OpenRefine, but I guess maybe something like Google Colab might be better in terms of sharing and having direct access to our G Drives.
I need to take another look at this. I've recently started working with RDF data on a scale of ~50 million URIs not including properties and statements, there's a ton of suplicates in there. I loaded a 10k entity subset of this and OpenRefine found all of the duplicates I had manually found plus others that I guess were 'similar'. Really cool, but it crashed when attempting to merge entities together. I've got a pipeline transforming the original JSON dataset to RDF, maybe it would work better working with the looser structure. What scale of CSV data did you have?
Can’t speak for OP but I moved to Exploratory.io. And the beauty of it is, it’s a GUI for R so you can export your transformation steps to R if needed.
I had a look at how to use this. The video I watched is a couple years old, but probably mostly relevant still. https://youtu.be/nORS7STbLyk
The thing that really resonates with me here is the way they use faceting to find bad data.
When I write pipelines on the command line, I sometimes find it necessary to filter and select data in various ways. Because of this I end up rerunning cli pipelines multiple times sometimes. If instead I dump it to csv I could see myself using OpenRefine with its faceting to pick out the relevant data for processing
You could rewrite it but it doesn't really solve a problem this thing has. Web assembly of course is an opportunity to bring in a lot of existing data processing frameworks from e.g. the python, julia, r, etc. worlds and run them in a browser. Refine did a lot of it's processing with a Java based server approach. The goal should be to reuse, not to reinvent if you take on a project like this. Chat gpt integration is a no brainer for this stuff these days. It excels at cleaning things up and figuring out unstructured/half structured data.
The spread sheet ui is super useful and something that non technical people are much more comfortable dealing with. I've used Google sheets as an interface to business people over the years. Whether it is categorizations, place descriptions, addresses, etc. just put it in a spreadsheet.
Instead of building complicated UIs and tools, you just build a csv/tsv importer and let people do their thing in a spreadsheet, export, validate, import. Once you get it in people's heads that the column names are off limits for editing, they kind of get it. The nice thing about this stuff is that it is low tech, easy, and effective. And easy to explain to an intern, product owner, or other person that needs to sit down and do the monkey work.
Refine takes this to the next level. You can take any old data in tabular format and cluster it phonetically, minor spelling differences, or by other criteria, bulk edit some rows, and export it. It's also easy to enrich things via some rest API or run some simple scripts. But even just the bulk editing and grouping is super useful. We used it when it was still Google Refine more than 12 years ago to clean up tens of thousands of POIs. Typically we'd be grouping things on e.g. the city name and find that there would be a few spelling variations of things like München, Munchen, Muenchen, Munich, etc. Toss in a few utf-8 encoding issues where the ü got garbled and it's a perfect tool for cleaning that up.
Tens of thousands of records is potentially a lot of work but still tiny data. We had a machine learning team that used machine learning as the hammer for the proverbial nail. Google Refine achieved more in 1 afternoon than that team did trying to machine learn their way out of that mess in half a year.
Cool tool that I've been looking for an excuse to use more. Glad to see that it has continued to updat and improve after evolving from Freebase Gridworks to Google Refine to OpenRefine.
It is a powerful tool but like other tools in this space (think e.g., pandas) it needs some serious getting used to, it is not the most intuitive user interface.
What was dissapointing last time I used it is arguably not a problem of openrefine at all: the connectivity and response of wikidata queries was very slow. But that combination of local data harmomization with an open and globally available reference is super-important. I hope it somehow receives more attention and traction.
Their main blocker is lack of a high-performance SPARQL backend, for complex queries where ordinary text-based search is not effective. https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/... has more information, though it's not fully up to date (Qlever especially has had significant updates since).
Tools like this allow for easy transformations and data wrangling while also keeping project history in a way that helps preserve lineage. Very valuable.
Not the most straightforward way to do it but it does support joining across two datasets. You can reference and add columns from a different “project”