You will already have this program in your Linux distro since I wrote it about a decade ago and it has been maintained by a small team ever since then. It also handles all the quirks of Excel CSV files and is used extensively in production.
I tried out OpenRefine previously on the large CSV's I have to deal with at work largely on the recommendation of HN comments. On small stuff to medium size stuff, it's pretty nice. But for larger sets (+1GB) it starts to slow down and eventually will fail to commit changes when the set is big enough.
The only answer I've found to consistently work quickly, with the ability to explore the data, is the killer combo of iPython Notebook + pandas' read_csv[0] + a lot of RAM -- 10GB CSV on disk becomes ~20GB in memory (don't know why yet). When I say quick, I mean 10GB CSV un-cached disk to RAM in <5Min including fuzzy parsing on dates.
The nice part is, when you have things figured out, you can enable a chunked reading to get back in-core on machines of lesser specs. Further, you can dump the pandas DataFrame to HDF, thereafter having ludicrous-speed IO & 'where' queries.
Still though, OpenRefine is much more turn key and feature rich.
These libraries should be able to work with data that large, though I can't say whether they meet your requirements yet. I'm not sure what exactly "10GB CSV un-cached disk to RAM in <5Min including fuzzy parsing on dates".
Namely, I don't know what you mean by date fuzzy parsing or what your output looks like after. Perhaps I need to open ipython notebook and import pandas ;)
I've been working to get Haskell approved at my place of employment for 1.5 years but getting the US Gov't to change is rather hard; I've had to sit the "tech" people down and explain that javascript != java... with that baseline, explaining 'Why Haskell' is non-trivial. Come September, after 1.5 years of effort, I should have 'all the FOSS'. Until then I have to wait.
It's worth noting that iPython Notebook + pandas vs. cassava + conduits (even with iHaskell Notebook) serve very different ends. If I need to explore how to do something, I'd use Haskell. But I'm still in phase 2 (phase 1: collect underpants) and I've yet to find anything as powerful and flexible as the iPython Notebook + pandas + hdf5 stack that also just works. I can just move faster with that stack than anything else I've ever seen. That being said, I'm knowingly deferring bugs to the runtime -- 'tis the cost of python.
If you're unfamiliar with pandas, the "quick vignette" here[0] is decent enough. The reason pandas is awesome, IMO, isn't actually because pandas is awesome (which it is) but because it's embedded in a full language. Julia, R, etc... can do the same stuff (maybe faster), but I wouldn't also want to program, say, a production web-app in them (though I have high hopes for Julia).
Fuzzy parsing on dates: pandas by default uses dateutil[1] which is both awesome and slow.
10GB CSV... : yeah... it's "fast for python" but pandas is admittedly doing a lot in that time, namely putting it into a data structure that is very friendly to time series analysis.
I'm curious what you mean by cassava + conduits with Haskell notebook serve different ends. For instance where would you use it in place of pandas/python/hdf5?
That depends on what you want to do with your junky input. And if it's junky enough, no tool is going to be able to divine what you want to do with the data until you do some manual cleanup. However, just to put it out there, csvkit has a utility, csvclean, that will split a file based on which lines are valid csv and which are not, allowing you to more quickly dive into what's wrong with the bad lines.
Grep understands text files, not CSV syntax. They're not equivalent. For example, the following CSV document has two lines, but grep treats them as three:
That's correct, and as you illustrate it's the possibility to have newlines and commas inside quoted fields that complicates things for grep/awk/cut/etc.
So instead of making a more complex version of tools like grep, we can make the data simple for these tools to understand. That's what https://github.com/dbro/csvquote does. It can be run in a pipeline before the grep stage, and allow grep/cut/awk/... to work with unambiguous field and record delimiters. Then it can restore the newlines and commas inside the quoted fields at the end of the pipeline.
using coreutils grep you'd need something like this:
grep -E '^ILLINOIS,' 2009.csv
Oh, wait, that doesn't work because the file was last saved by something which quotes it.
grep -E '^"?ILLINOIS"?,'
except that would produce unexpected results with 'ILLINOIS"' so it really needs to be something like:
grep -E '^(["]{0,1})ILLINOIS\1,'
Bear in mind that this is the simplest possible case and doesn't even touch on issues like quoting in the shell or needing to handle files which have embedded separators as data values (imagine what happens when our erstwhile grep data-miner needs to check the stats for `WASHINGTON, DISTRICT OF COLUMBIA`…).
In all but the most trivial cases it's safer and easier to use tools designed for the job. csvkit also has the very nice property that it's callable as a Python library so when you outgrow a simple shell processing pipeline you could migrate your notes / scripts to a full Python program without having to retest everything related to basic file I/O, unicode, etc. which you would otherwise need to do when switching readers.
(Bear in mind that the author works in data journalism – the target user is not a grizzled Unix expert but someone who has a bunch of CSV files and a full time job which is not shell scripting)
> And - without testing - I presume csvkit in Python is a bit slower than the GNU coreutils in C?
Perhaps but it'd be unlikely for it to be noticeable for n less than millions on a remotely modern system – the Python regular expression engine has quite decent performance and if that became an issue, PyPy will even JIT them for you. In the very few cases I've seen where there is a noticeable difference it was always because the Python version was decoding Unicode and the shell tools were running with LC_ALL=C, which meant that corrupt data made it further before being caught and, in some cases, either failing to match all of the records or subtly quietly things by not extracting all of a combined character, etc.
For the target use-case, however, this is likely all to be many, many orders of magnitude less than the time most people would spend debugging regex patterns.
Yeah, I'm more sensitive to this class of errors now that I live in DC since the "Washington, District of Columbia" format is common enough to show up sporadically.
My examples anchored it to avoid matching outside of the expected field. That probably wouldn't matter for the easy data used in the csvkit example but, for example, suppose you were looking at business data and your search for California included records from every state with a California Pizza Kitchen, California Tortilla, etc. Or, the reverse, you want your search for Washington to include records from the state but not DC.
This class of error is somewhat treacherous since it's common for people not to notice it before they start working with a full data file. Using tools which don't require constant caution to avoid data errors is simply a basic good habit.
> So I personally just don't see the benefit from a less generalised version of coreutil tools
Others have pointed out specific examples, but the general reason (which I don't think I've seen explicitly articulated) is that CSV isn't strictly line oriented -- for intsance, records can span multiple "physical" lines, because line breaks within quoted strings are part of the field value, not separate records -- so line-oriented tools don't work with CSV generally (they may if you happen to have CSV where nothing inconsistent with a line-oriented view happens to be done, but that's luck not something that is generally true of CSV.)
I'm an experienced Unix admin, and I actually wrote my own versions of these csv tools long ago instead of using coreutils. I had a lot of different datasets to with with, and it turned it there were so many inconsistencies in the CSV data that I needed a tool that could handle it all.
It also became super simple to extract and munge data with options for tabular data. With two tiny commands I can exclude rows, include rows, sed columns, merge tables, and export new tables, all with just csv files. So it's pretty handy.
I'd love to add similar features to textql[1]. am currently working on a v2 that does a lot more, but right now it fills a gap not in CSVKit, which is the direct execution of SQL on flat files.
[1]: https://metacpan.org/pod/distribution/App-fsql/bin/fsql