Hacker News new | past | comments | ask | show | jobs | submit login

> The syntax doesn't seem noticeably clearer?

From the very first example:

    csvgrep -c 1 -m ILLINOIS 
using coreutils grep you'd need something like this:

    grep -E '^ILLINOIS,' 2009.csv 
Oh, wait, that doesn't work because the file was last saved by something which quotes it.

    grep -E '^"?ILLINOIS"?,'
except that would produce unexpected results with 'ILLINOIS"' so it really needs to be something like:

    grep -E '^(["]{0,1})ILLINOIS\1,'
Bear in mind that this is the simplest possible case and doesn't even touch on issues like quoting in the shell or needing to handle files which have embedded separators as data values (imagine what happens when our erstwhile grep data-miner needs to check the stats for `WASHINGTON, DISTRICT OF COLUMBIA`…).

In all but the most trivial cases it's safer and easier to use tools designed for the job. csvkit also has the very nice property that it's callable as a Python library so when you outgrow a simple shell processing pipeline you could migrate your notes / scripts to a full Python program without having to retest everything related to basic file I/O, unicode, etc. which you would otherwise need to do when switching readers.

(Bear in mind that the author works in data journalism – the target user is not a grizzled Unix expert but someone who has a bunch of CSV files and a full time job which is not shell scripting)

> And - without testing - I presume csvkit in Python is a bit slower than the GNU coreutils in C?

Perhaps but it'd be unlikely for it to be noticeable for n less than millions on a remotely modern system – the Python regular expression engine has quite decent performance and if that became an issue, PyPy will even JIT them for you. In the very few cases I've seen where there is a noticeable difference it was always because the Python version was decoding Unicode and the shell tools were running with LC_ALL=C, which meant that corrupt data made it further before being caught and, in some cases, either failing to match all of the records or subtly quietly things by not extracting all of a combined character, etc.

For the target use-case, however, this is likely all to be many, many orders of magnitude less than the time most people would spend debugging regex patterns.




I take you're point and those other well made points about line continuations below -- but you're over egging it here as

cut -d, -f1 2010.csv | grep ILLINOIS

works just fine for this data. I emphasise again - I take your overall point though.


Yeah, I'm more sensitive to this class of errors now that I live in DC since the "Washington, District of Columbia" format is common enough to show up sporadically.


I also don't really understand why you want to grep the whole block instead of the single word 'ILLINOIS'.


My examples anchored it to avoid matching outside of the expected field. That probably wouldn't matter for the easy data used in the csvkit example but, for example, suppose you were looking at business data and your search for California included records from every state with a California Pizza Kitchen, California Tortilla, etc. Or, the reverse, you want your search for Washington to include records from the state but not DC.

This class of error is somewhat treacherous since it's common for people not to notice it before they start working with a full data file. Using tools which don't require constant caution to avoid data errors is simply a basic good habit.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: