Wow, that is a nice little tool. Just installed it and tested on some random files in my current data analysis project.
By the way, I installed it using pipx https://github.com/pipxproject/pipx by running `pipx install visidata`. To also read HDF and Excel files, I added the necessary packages by running `pipx inject visidata h5py openpyxl`.
Thanks @rasmusei! If you are a data scientist you might also be interested in how to work along with Jupyter. Our community has some documentation on our Wiki about that here: https://github.com/OpenRefine/OpenRefine/wiki/Jupyter
My own experience: I had a lot of data to process, which I thought was the use case for a tool like this: but it took a long time, and seemed to have to process the data in order to ingest it properly.
What underlying storage/tech is used? Is it all just web-stack?
It's been my go-to tool for almost a decade for figuring out what's in strange files, and working out how to ingest them. I learned about it from ProPublica.com; they used it in their series about payola to doctors from big pharma.
It's Java. You can mess with the virtual-memory jvm parameters to make it handle more data.
It has a disadvantage in the workplace: it's hard for some managers to believe it's local only. The name Google on it means, to them, that it must upload data to cloud servers owned by Google. Maybe the rebranding has reduced that hurdle.
yes, the size of the data you can process is limited. I had reasonable results with something like 50MiB, after tweaking jvm parameters.
One limitation I find annoying is that you can't easily strip your transformation from the data and apply it to similar but different data. It's all together "a project."
So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.
I used it to look at a json export from a SaaS tool, and to convert it to table structure. Cleaning field contents, which where following certain conventions, but which evolved over time, things like that. For such use-cases it's powerful.
> So incrementally transforming data works well, applying that transformation elsewhere not really. That bothered me more than the size limitation, which I think is not limiting most use cases with real “messy” data anyway. Maybe processing large volume event / log data would still need something else.
Agreed, this would be the one feature I would really need. Would be nice to be able to setup (and refine over time) pipelines to automatically clean up new data.
YES! We want users to process much larger data sets also! We have started experiments with using Apache Spark on the backend where the hope is that we can help users with much larger datasets. This work is being funded by CZI and you can read the grant proposal here: http://openrefine.org/blog/2019/11/14/czi-eoss.html
Long time back, I built an extension, by which you can take the openrefine mappings, and run on a hadoop cluster. The idea is you would run all transformations on a small dataset on your local machine. Once you are satisfied with the mappings, you would deploy the same on hadoop cluster. I have tested this on large datasets, and it works. Let me know how I can help.
You can check my github: https://github.com/rmalla1/OpenRefine-HD
That author did not have Spark tuned well for the use case. This is a common issue with Spark. Since OpenRefine commonly is used with Strings, we plan to optimize in many areas for that such a few mentioned here: https://databricks.com/glossary/spark-tuning But in general, there are always tradeoffs when trying to provide immediate feedback for interactions. Since OpenRefine has many interactive features, some will need to support batching and advise the user in the interface that things will take longer...do you want to send to batch? Some of the tradeoffs and ways we plan to address these are mentioned in our general OpenRefine on Spark issue here: https://github.com/OpenRefine/OpenRefine/issues/1433
It's Java, but I don't know what's underneath that.
There web browser connects to a local server. You can increase the RAM allocated to the JVM, and I'm fairly sure I've worked on hundreds of thousands of rows, although I don't remember the operations I was using.
I've used it sparingly but recognize that I should be using it more. It is really powerful for cleaning up messy data. I'm not sure of the stack but I did reach out to one of the developers about being on HackerNews. Hopefully they can respond.
Follow up:
Looking over the old SIMILE site I couldn't find the original project. Also David Huynh didn't mention it in his own website, but some searching yielded the original project, "Parallax"
SIMILE library is used in OpenRefine for certain Faceting like timeline, clustering, etc. Parallax was originated by David to show how time series data visualizations could be enhanced. David was one of our original designers of OpenRefine and I worked closely with him and Stefano in testing it.
You can use python code for more advanced data manipulations and creating plugins.