Hacker News new | past | comments | ask | show | jobs | submit login
CSVKit: CSV utilities that includes csvsql, csvgrep, csvstat, and more (readthedocs.org)
103 points by ConceitedCode on May 27, 2014 | hide | past | favorite | 34 comments



I like fsql[1], as it allows me to combine multiple CSV, TSV, LTSV. JSON and YAML files as SQL tables and query against them. Very useful.

[1]: https://metacpan.org/pod/distribution/App-fsql/bin/fsql


csvtool:

https://forge.ocamlcore.org/plugins/scmgit/cgi-bin/gitweb.cg...

You will already have this program in your Linux distro since I wrote it about a decade ago and it has been maintained by a small team ever since then. It also handles all the quirks of Excel CSV files and is used extensively in production.


> You will already have this program in your Linux distro

Yum disagrees on CentOS 6.4.


It's actually in the package "ocaml-csv.x86_64" on CentOS 6.4, which does come up if you "yum search csvtool" (for me anyway)


Ah, apologies then. I actually saw that but completely discounted it because it looked like an OCaml library rather than a command line tool!


This works with yum for installing programs:

    yum install /usr/bin/csvtool


No, everyone must think they can start by splitting on commas and descending slowly into madness. That is the way of the programmer.


Also openrefine (formally google refine)

  http://openrefine.org/
is like a GUI version of csvkit.

It can do external look ups, fuzzy matching, and has its own programming languages Jython and GREL.


I tried out OpenRefine previously on the large CSV's I have to deal with at work largely on the recommendation of HN comments. On small stuff to medium size stuff, it's pretty nice. But for larger sets (+1GB) it starts to slow down and eventually will fail to commit changes when the set is big enough.

The only answer I've found to consistently work quickly, with the ability to explore the data, is the killer combo of iPython Notebook + pandas' read_csv[0] + a lot of RAM -- 10GB CSV on disk becomes ~20GB in memory (don't know why yet). When I say quick, I mean 10GB CSV un-cached disk to RAM in <5Min including fuzzy parsing on dates.

The nice part is, when you have things figured out, you can enable a chunked reading to get back in-core on machines of lesser specs. Further, you can dump the pandas DataFrame to HDF, thereafter having ludicrous-speed IO & 'where' queries.

Still though, OpenRefine is much more turn key and feature rich.

[0]http://pandas.pydata.org/pandas-docs/version/0.13.1/generate...


http://hackage.haskell.org/package/csv-conduit http://hackage.haskell.org/package/cassava

These libraries should be able to work with data that large, though I can't say whether they meet your requirements yet. I'm not sure what exactly "10GB CSV un-cached disk to RAM in <5Min including fuzzy parsing on dates".

Namely, I don't know what you mean by date fuzzy parsing or what your output looks like after. Perhaps I need to open ipython notebook and import pandas ;)


I've been working to get Haskell approved at my place of employment for 1.5 years but getting the US Gov't to change is rather hard; I've had to sit the "tech" people down and explain that javascript != java... with that baseline, explaining 'Why Haskell' is non-trivial. Come September, after 1.5 years of effort, I should have 'all the FOSS'. Until then I have to wait.

It's worth noting that iPython Notebook + pandas vs. cassava + conduits (even with iHaskell Notebook) serve very different ends. If I need to explore how to do something, I'd use Haskell. But I'm still in phase 2 (phase 1: collect underpants) and I've yet to find anything as powerful and flexible as the iPython Notebook + pandas + hdf5 stack that also just works. I can just move faster with that stack than anything else I've ever seen. That being said, I'm knowingly deferring bugs to the runtime -- 'tis the cost of python.

If you're unfamiliar with pandas, the "quick vignette" here[0] is decent enough. The reason pandas is awesome, IMO, isn't actually because pandas is awesome (which it is) but because it's embedded in a full language. Julia, R, etc... can do the same stuff (maybe faster), but I wouldn't also want to program, say, a production web-app in them (though I have high hopes for Julia).

Fuzzy parsing on dates: pandas by default uses dateutil[1] which is both awesome and slow.

10GB CSV... : yeah... it's "fast for python" but pandas is admittedly doing a lot in that time, namely putting it into a data structure that is very friendly to time series analysis.

[0] http://pandas.pydata.org/ [1] https://labix.org/python-dateutil


Wow! I wasn't expecting such a response! Awesome. I'll definitely have to check this out. About dateutil... actually I used it at work today.

As for Haskell and fuzzy dates:

http://hackage.haskell.org/package/dates

I'm curious what you mean by cassava + conduits with Haskell notebook serve different ends. For instance where would you use it in place of pandas/python/hdf5?


I'm going to continue to throw it out there when anybody mentions csvkit, but I can't believe that more people haven't settled on http://csvfix.byethost5.com/csvfix15/csvfix.html


Why? I know and love csvkit, but never heard of csvfix.


It looks like csvfix has more features.


And no python dependency (although that hardly matters.)

edit: This is a better link http://neilb.bitbucket.org/csvfix/


Do any of y'all have an opinion on which of the tools mentioned here is the most robust to junky input?


That depends on what you want to do with your junky input. And if it's junky enough, no tool is going to be able to divine what you want to do with the data until you do some manual cleanup. However, just to put it out there, csvkit has a utility, csvclean, that will split a file based on which lines are valid csv and which are not, allowing you to more quickly dive into what's wrong with the bad lines.


That alone would be helpful. Thanks!


I'm not totally sure why csvgrep csvsort csvcut csvjoin are needed as replacements for grep sort, cut or join?

The syntax doesn't seem noticeably clearer?

And - without testing - I presume csvkit in Python is a bit slower than the GNU coreutils in C?


Grep understands text files, not CSV syntax. They're not equivalent. For example, the following CSV document has two lines, but grep treats them as three:

  1,2,5,"Hello"
  3,4,5,"How are
  you doing?"


That's correct, and as you illustrate it's the possibility to have newlines and commas inside quoted fields that complicates things for grep/awk/cut/etc.

So instead of making a more complex version of tools like grep, we can make the data simple for these tools to understand. That's what https://github.com/dbro/csvquote does. It can be run in a pipeline before the grep stage, and allow grep/cut/awk/... to work with unambiguous field and record delimiters. Then it can restore the newlines and commas inside the quoted fields at the end of the pipeline.


> The syntax doesn't seem noticeably clearer?

From the very first example:

    csvgrep -c 1 -m ILLINOIS 
using coreutils grep you'd need something like this:

    grep -E '^ILLINOIS,' 2009.csv 
Oh, wait, that doesn't work because the file was last saved by something which quotes it.

    grep -E '^"?ILLINOIS"?,'
except that would produce unexpected results with 'ILLINOIS"' so it really needs to be something like:

    grep -E '^(["]{0,1})ILLINOIS\1,'
Bear in mind that this is the simplest possible case and doesn't even touch on issues like quoting in the shell or needing to handle files which have embedded separators as data values (imagine what happens when our erstwhile grep data-miner needs to check the stats for `WASHINGTON, DISTRICT OF COLUMBIA`…).

In all but the most trivial cases it's safer and easier to use tools designed for the job. csvkit also has the very nice property that it's callable as a Python library so when you outgrow a simple shell processing pipeline you could migrate your notes / scripts to a full Python program without having to retest everything related to basic file I/O, unicode, etc. which you would otherwise need to do when switching readers.

(Bear in mind that the author works in data journalism – the target user is not a grizzled Unix expert but someone who has a bunch of CSV files and a full time job which is not shell scripting)

> And - without testing - I presume csvkit in Python is a bit slower than the GNU coreutils in C?

Perhaps but it'd be unlikely for it to be noticeable for n less than millions on a remotely modern system – the Python regular expression engine has quite decent performance and if that became an issue, PyPy will even JIT them for you. In the very few cases I've seen where there is a noticeable difference it was always because the Python version was decoding Unicode and the shell tools were running with LC_ALL=C, which meant that corrupt data made it further before being caught and, in some cases, either failing to match all of the records or subtly quietly things by not extracting all of a combined character, etc.

For the target use-case, however, this is likely all to be many, many orders of magnitude less than the time most people would spend debugging regex patterns.


I take you're point and those other well made points about line continuations below -- but you're over egging it here as

cut -d, -f1 2010.csv | grep ILLINOIS

works just fine for this data. I emphasise again - I take your overall point though.


Yeah, I'm more sensitive to this class of errors now that I live in DC since the "Washington, District of Columbia" format is common enough to show up sporadically.


I also don't really understand why you want to grep the whole block instead of the single word 'ILLINOIS'.


My examples anchored it to avoid matching outside of the expected field. That probably wouldn't matter for the easy data used in the csvkit example but, for example, suppose you were looking at business data and your search for California included records from every state with a California Pizza Kitchen, California Tortilla, etc. Or, the reverse, you want your search for Washington to include records from the state but not DC.

This class of error is somewhat treacherous since it's common for people not to notice it before they start working with a full data file. Using tools which don't require constant caution to avoid data errors is simply a basic good habit.


They're not replacements for those tools. They're new tools that work with a specific tabular data format.


The data I come across is a mix of comma, tab, and space delimited though predominantly what the commenter above would call "line oriented".

So I personally just don't see the benefit from a less generalised version of coreutil tools

... but fair does I'm sure there are plenty who have use for a more specific tool.


> So I personally just don't see the benefit from a less generalised version of coreutil tools

Others have pointed out specific examples, but the general reason (which I don't think I've seen explicitly articulated) is that CSV isn't strictly line oriented -- for intsance, records can span multiple "physical" lines, because line breaks within quoted strings are part of the field value, not separate records -- so line-oriented tools don't work with CSV generally (they may if you happen to have CSV where nothing inconsistent with a line-oriented view happens to be done, but that's luck not something that is generally true of CSV.)


I'm an experienced Unix admin, and I actually wrote my own versions of these csv tools long ago instead of using coreutils. I had a lot of different datasets to with with, and it turned it there were so many inconsistencies in the CSV data that I needed a tool that could handle it all.

It also became super simple to extract and munge data with options for tabular data. With two tiny commands I can exclude rows, include rows, sed columns, merge tables, and export new tables, all with just csv files. So it's pretty handy.


And don't forget AWK!

I've been playing with these old school tools lately on medium sized tabular data files and I've been impressed by the performance and capabilities.


Would be nice if a tool works on querying markdown/org-mode tables... correlations, averages and sums would be nice too.


I'd love to add similar features to textql[1]. am currently working on a v2 that does a lot more, but right now it fills a gap not in CSVKit, which is the direct execution of SQL on flat files.

[1]: https://github.com/dinedal/textql




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: