Hacker News new | past | comments | ask | show | jobs | submit login
Using command line to process CSV files (2022) (muhammadraza.me)
161 points by mr_o47 on June 28, 2023 | hide | past | favorite | 110 comments



Unfortunately "awk -F," (field separator of comma) doesn't work with most real CSV files, because of quoted fields, commas in fields, and (less frequently) multiline fields. My GoAWK implementation has a CSV mode activated with "goawk -i csv" (input mode CSV) and some other CSV features that properly handle quoted and multiline fields: https://benhoyt.com/writings/goawk-csv/

The frawk tool (written in Rust) also supports this.

Interestingly, Brian Kernighan is currently updating the book "The AWK Programming Language" for a second edition (I'm one of the technical reviewers), and Gawk and awk are adding a "--csv" option for this purpose. So real CSV mode is coming to an AWK near you soon!


This is a common theme with the classic unixy text processing based workflows, they work really nicely when the incoming data is nice and simple but fall apart on edge cases. And because often also the error handling is incomplete it might be difficult to know that the pipeline is producing garbage output.

Related example is the recently discussed problems with filenames, and the many pitfalls that scripts can stumble upon when handling them.


> but fall apart on edge cases.

The problem is that edge cases are not really edge cases… in certain locales. ;) Someone could be using commas as the decimal separator since that’s how they do it ’round these parts.[1][2] So it’s not like they’re storing “weird” stuff in tabular format, like HTML blog articles; they’re just storing numbers.

Of course, Excel and other software has tried to be way too clever about it with different separators depending on locale, which makes the whole pipeline complex and context-dependent.

[1] A surprising amount of countries have this convention, including my own… doesn’t make sense to me since it becomes harder to list numbers.

[2] And the number one selling point of CSV is that it is “human editable”, so of course someone is going to do that at some point in the pipeline.


> And because often also the error handling is incomplete it might be difficult to know that the pipeline is producing garbage output.

The thing is that from the point of view of each individual tool, there isn't even an error! You ask awk to split on commas and give you the second field, it splits on commas and gives you the second field. It opened the file no problem, it didn't run out of memory writing its output to stdout... There were no errors to speak of from what it could see. The "error" is an error in the overall logic of the code.


That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.

<https://github.com/dbro/csvquote>

Using it with the first example command from this article would be

  csvquote file.csv | awk -F, '{print $1}' | csvquote -u
By using the "-u" flag in the last step of the pipeline, all of the problematic quoted delimiters get restored.


Another problem is CSV isn't really a text format. It's a binary format that just happens to look like text and most implementations will handle it if you edit it as text. But the record separator is meant to be specifically CRLF and there might be other complications. This can be confusing when using these tools to try to output another CSV. Do the `--csv` modes support also writing back to CSV?


No, unfortunately the Gawk/awk --csv option doesn't support writing back to CSV. FWIW, GoAWK supports CSV input mode with "-i csv" and CSV output mode with "-o csv" for this purpose, as well as special syntax for fetching a @"named-field" rather than a $numbered_field. Docs: https://github.com/benhoyt/goawk/blob/master/docs/csv.md


2nd edition?! I just purchased a copy of the first and it's fantastically concise. I've always had a great experience with books Kernighan has been a part of, and it's awesome to see that the book's team are domain experts. Looking forward to it!


The first edition has been online for free for some time, in case you want to share it or use it in cases where the hardcopy is less convenient.

https://archive.org/details/pdfy-MgN0H1joIoDVoIC7


A second edition is coming?? Awesome!


I think “most” is pulling a lot of work there. I make and process csvs a lot, and just avoid quotes when generating them.

Sure technically you can do things like having new lines on a line which cause all sorts of grief, but if you control the end to end workflow just avoid it.

I have a tool which monitors performance of loading various websites via various proxies at various points ok our network with pupetteer. That outputs a csv which I can then extract data from and make handy graphs to show manglement.


> avoid quotes when generating them

Yeah, I guess that kinda works if you're the one generating them. But you must not have too many generalised strings -- like a name field containing "Smith, John" or a book title like "The Lion, the Witch and the Wardrobe"? When I deal with CSV's those kind of things are quote common. I actually prefer TSV because you can (more easily) process them with standard AWK.

What do you output for a string field with ',' in it? Just strip the commas, or replace it with something else?


Deal with ip addresses and numbers, maybe host names, and csvs are perfect to process


Quotes aren't the problem, it's having an in-band separator character. Maybe you're lucky enough to only need naive splitting on commas, but lots of us need to jam in human-produced text stings where the punctuation needs to be preserved without breaking file parsing.


FPAT declaration is a workaround for ignoring commas

https://stackoverflow.com/questions/43477652/ignoring-comma-...


That deals with commas but (since AWK is line-oriented) I doubt that deals with multi-line fields?


Also localized csv files in a language where the comma is the decimal separator (and the field separator is ; ), column headers and unexpected character encodings. CSV is truly the worst.


For any kind of tabular data, I use DuckDB (parallel CSV reads, works on small to humongous CSVs). It supports full Postgres-like SQL, so you can do arbitrarily complex manipulations on columns and rows. The CSV essentially becomes a dataframe object that can be manipulated in performant ways (DuckDB vectorizes and paralllelizes, so it's way faster than awk on large CSV files). You can even manipulate multiple CSVs in a single statement (JOINs, UNIONs etc).

  % duckdb -c "from 'test.csv'"
  ┌───────┬───────┬───────┐
  │   A   │   B   │   C   │
  │ int64 │ int64 │ int64 │
  ├───────┼───────┼───────┤
  │     1 │     2 │     3 │
  │     4 │     5 │     6 │
  │     7 │     8 │     9 │
  └───────┴───────┴───────┘

  % duckdb -c "select sum(C) from 'test.csv'"
  ┌────────┐
  │ sum(C) │
  │ int128 │
  ├────────┤
  │     18 │
  └────────┘

  % duckdb -c "select sum(A + 2*B + C^2) from 'test.csv'"
  ┌────────────────────────────────┐
  │ sum(((A + (2 * B)) + (C ^ 2))) │
  │             double             │
  ├────────────────────────────────┤
  │                          168.0 │
  └────────────────────────────────┘


Yeah, sqlite has been my preferred method for working with csv files for a long time. I will say that one nice thing about DuckDB over SQLite is that it does a better job inferring column types than SQLite does.


I do use sqlite too to process csv file, but I either rely on csvkit to generate the schema from the csv or create the table with raw sql before inserting data in the sqlite database.


iirc it's not possible to read csv in a parallel manner if your parsing true to spec?


I think you can use a heuristic approach. E.g. pick a comma in the middle of the CSV, assume it is a delimiter (not inside data) and parse from there in a separate thread. If you find it doesn't parse correctly (perhaps because it is inside data) you might have to throw the results of that thread away. Kind of ugly but would probably allow faster parsing in 99% of cases.

CSVs are a bit of a mess. Non-trivial to parse properly, due to escaping, and the standard (such as it is) is poorly followed. E.g. sometimes people use semi-colon or pipe instead of comma.

See also: 'Why isn’t there a decent file format for tabular data?' https://news.ycombinator.com/item?id=31220841


I've found that for tabular data parquet can't be beat, the only time textual formats are superior are for human readabilty (which imo can be solved by tooling)


It is useful to be able to create, view or edit data in a standard text editor though.


Edits might be a little tricky, there's no concept of writing in place in parquet afaict. But there's absolutely no reason why reads/creation can't be supported in text editors (I'm sure some do).


I often have to create test data files a few rows long. Very easy to do for a CSV file.


Absolutely.

This is also my use case where I have GBs of data in S3 as Parquet and a small CSV file that I have to join them to.

And DuckDB reads CSVs and Parquet and SQLite and others. I can join all these heterogeneous data types in a single SQL statement and have the assurance that it’ll be done correctly.

I believe clickhouse-local can do the same.


My understanding is that the CSV is read in chunks, and then each chunk is parsed in a different thread.

This is the description in the experimental PR (parallel CSV has since merged into the main branch)

https://github.com/duckdb/duckdb/pull/5194


Sounds like a good way to fail to parse all the possible files properly.


How so?


Because you read a \n or a comma but you don't know if there was an open quote before the block you're currently reading.


You can't read SAME csv in a parallel manner but you can still read two different csvs in two separate threads? I think that's what they meant by "parallel csv".


I didn't know that and my Google fu is lacking, does anyone have a resource I could read to understand that limitation?


I think the fundamental issue is that a single 'cell' (value) can contain newlines, so you can't just assume each line is a 'row' and trivially parallelize.

The problem is perhaps most obvious when you consider that the value of a single cell could itself be a CSV, and this could be recursive


I haven’t worked out if the BNF is unambiguous (RFC 4180) but if you have a CSV in a cell you would escape any ambiguous characters.

An unescaped cell within a CSV will break most CSV parsers in the world even with recursion handling.

https://datatracker.ietf.org/doc/html/rfc4180


It is kinda sad that one has to go through all the database technologies just in order to read a csv file nowadays.


With raw string parsing approaches (awk, perl) you have to handle all kinds of corner cases because CSV is an underspecified spec. CSV is deceptive in that it looks like it parses easily, but the truth is CSV is one of the most difficult formats to reliably parse. I mean, just look at the number of arguments that pandas.read_csv has.

https://pandas.pydata.org/docs/reference/api/pandas.read_csv...

Parsing a CSV by hand is almost always the wrong thing to do unless you also wrote the CSV generating process and can limit the variance.

But if not, instead of working with CSVs at the string level and risk running into issues, I've found that the best practice is to farm the CSV parsing out to a battle-tested engine (I use DuckDB, but sqlite, pandas, polars, clickhouse-local or any number of engines work), and work with the parsed object instead.

Better yet, avoid CSV and use DuckDB + Parquet (paired with visidata for quick vizes) for anything nontrivial.


With Perl, I wouldn't do raw string parsing; I'd use a Text::CSV (a battle-tested engine). For example, this filters a file foo.csv where `some_header` field is positive (you could also use sort or map or whatever instead of grep):

  perl -MText::CSV=csv -e'my $aoh = csv(in=>"foo.csv",headers=>"auto",keep_headers=>\my @h); @$aoh = grep { $_->{some_header} > 0 } @$aoh; csv(in=>$aoh,out=>*STDOUT,headers=>\@h)'


You don't have to, but you have to admit DuckDB is quite nice for handling csvs quite a bit more intuitively than awk


You don't gave to, but it makes it easier. New tools that helps my work makes me happy.


Have a look at visidata, a TUI table viewer/editor with vim keybindings. It's incredibly powerful and fully automatable - everything you do gets recorded as a series of commands which you can save and replay.

Not the right tool for everything, but it shines for quickly glancing at the shape of tabular data and making sense of it. Sorting, filtering, joins across files, column histograms and even column splitting/rejoining are all keystrokes away.

It groks anything even remotely table shaped like CSV, JSONL, JSON, even Excel files, it can even directly connect to databases and parse tables out of HTML.

https://www.visidata.org/


VisiData is amazing. It's one of the coolest open source projects I've seen in years. I use it extensively in my research.


An intro to the SQL interface (vdsql) can be seen here: https://www.visidata.org/blog/2022/connect-visidata-to-sql-d...


Also, there is the VisiData Tutorial by Jeremy Singer-Vine. This is a very helpful introduction to VisiData.

https://jsvine.github.io/intro-to-visidata/index.html

I am a big fan of VisiData for working with data in the terminal. I also like John Kerl's miller[1] and Simon Willison's sqlite-utils [2] when I want to do queries. These are useful for working with data in simple shell scripts.

[1] https://github.com/johnkerl/miller [2] https://github.com/simonw/sqlite-utils


Personally, I use xsv and it’s been tremendously helpful, especially when working with larger files. https://github.com/BurntSushi/xsv


xsv is great for a quick sanity checks (i.e. number of columns, unique values counts in a given column) but for a more serious tasks/giant files I switch to either polars or duckdb converting CSV/TSV files to parquet or parquet data sets.

By giant I mean 25G gzipped files with >10^9 rows like these VCFs: https://ftp.ncbi.nlm.nih.gov/snp/latest_release/VCF/


Maybe try the to [1] and sqlp [2] commands from the qsv fork. From the README:

sqlp: Run blazing-fast Polars SQL queries against several CSVs - converting queries to fast LazyFrame expressions, processing larger than memory CSV files.

to: Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.

[1] https://github.com/jqnatividad/qsv/blob/master/src/cmd/sqlp....

[2] https://github.com/jqnatividad/qsv/blob/master/src/cmd/to.rs...


I shared the qsv fork [1] yesterday which is more active.

xsv is more lean while qsv tries to support every action that you might want to perform on CSV files.[2]

[1] https://github.com/jqnatividad/qsv

[2] https://github.com/jqnatividad/qsv/discussions/290#discussio...


Cool stuff! But it’s criminal to not call attention to JQ’s elder sibling CSVKit. It’s invaluable for playing with csvs. Much easier to parse out columns, allows you to generate new csvs and even merge them. More importantly, it allows SQL on csvs (via SQLite iirc) which empowers all sorts of csv shenanigans. The bash scripting this enables us incredible (good and bad).

https://csvkit.readthedocs.io/en/latest/


I agree it's indeed one of the best tool to process csv files. Using basic coreutils command are going to be soon very limited and brittle to process csv files.


As an SRE, I used to install one tool for every data format until I came across clickhouse-local. Now I use it for everything.

- It works with every data format which I come across in my daily job. It support data formats like Protobuf, Avro, Cap'n Proto which regular tools don't support. Funny thing, it can even read mysql dumps.

- I can read the data stored in local or remote location like http, s3, gcs, azure and whatever location I can think of.

- I can SQL queries on the raw data and improve my SQL skills on daily basis.


clickhouse.com is down for me, so for those looking for it the github repo[1] may be more useful

[1] https://github.com/ClickHouse/ClickHouse/tree/master#readme


We bought the clickhouse.com domain two years ago, and it was listed in some adblockers due to abuse by the previous owner. We have cleaned up most of the block lists, but there is still a chance it is filtered by your ad-blocker.


Great tip, thank you.


None of this stuff works.

Whenever CSV comes up I always feel a bit sad. ASCII includes a set of four out-of-band delimiters[0] that can be used instead of silly formats like CSV that use in-band delimiters which necessitate complicated quoting rules.

You can't just treat CSV as text. It's not and woe betide you if you ever use this stuff in a script instead of using a proper CSV parsing tool. If we used the ASCII delimiters instead it would be possibly to treat it as text and stuff like this would work.

[0] https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text


Preach; the POSIX standard also understands this, and partitions control characters into portable and non-portable ones; the ASCII field/record/group delimiters are designated as non-portable, and therefore can be safely used to delimit portable character data. The fact that the whole file is then non-portable is a feature, not a bug; if you need to pack an entire table into a field, encode it in some portable encoding (eg base64) and add a decoder field.

The "arguments" against this are absurd; mainly that the resulting file is not plain text and cannot be read/modified immediately by plain text editors. This ignores the fact that plain text editors already deal with control characters because they have to decide what to do with <tab>, and makes a mountain out of a molehill when the in-band horrors of escaping delimiters create trouble every single day.


> The fact that the whole file is then non-portable is a feature, not a bug; if you need to pack an entire table into a field, encode it in some portable encoding (eg base64) and add a decoder field.

I guess you could:

1. Use unit-separator and record-separator for text

2. Use group-separator for (1)

3. Use file-separator for (2)

> The "arguments" against this are absurd; mainly that the resulting file is not plain text and cannot be read/modified immediately by plain text editors. This ignores the fact that plain text editors already deal with control characters because they have to decide what to do with <tab>, and makes a mountain out of a molehill when the in-band horrors of escaping delimiters create trouble every single day.

Could not agree more.


I agree. And some tools support it, e.g. Miller.[1]

There are a lot of power tools for CSV. All they need to do to support this format is to support it as input and output. It should be possible for example to add support to Csvkit[2] as input/output ("format").

Just based on HN comments alone (these things always come up) there is a decent demand for this kind of format. And there are a few implementations out there. But the implementations often look more like standards or proofs of concepts rather than full-fledged implementations. (A full-fledged implementation should probably provide functions/programs to convert to and from CSV as described in RFC-4180.)

We could make this happen.

[1] https://news.ycombinator.com/item?id=31258316

[2] https://csvkit.readthedocs.io/en/latest/scripts/in2csv.html


Yes, has been discussed before. https://news.ycombinator.com/item?id=31220841

The only thing needed for this to work really well for text editors to treat the end of record character like a carriage return (I'm not aware of any that do).


It's worth noting that a lot of these tricks will break on non-trivial CSV inputs, e.g. ones that contain escaped commas.

I like using the shell (and Awk in particular!) as much as anyone, but for CSV I tend to reach for Python's standard csv module[1].

[1]: https://docs.python.org/3/library/csv.html


A thing Python's csv module enables that many others don't is ragged CSVs (column numbers not all same).


Nushell is also quite powerful for this. For example:

  > open people.csv | where status == 'customer' | unique-by email | select surname forename email | sort-by email | save customers.json
https://www.nushell.sh/


My favorite tool is clickhouse-local: https://clickhouse.com/blog/extracting-converting-querying-l...

It is the most powerful (works with any formats, with remote datasets, and supports all the ClickHouse SQL) and the most performant.


Agreed here, it's impressive how easy it is to use and how performant it is.

Would be nice to add a seamless ability to call executable UDFs from clickhouse-local, last time I checked clickhouse-local required executables to be in a special directory (as in proper clickhouse). Instead it'd be nice to be able to reference any executable in an ad-hoc way.


In my day job, a customer would ask me to reconcile between 2 giant CSVs very often. Think GBs.

When I tell them to do it in Excel by themselves, they would say Excel couldn't open a CSV larger than 1M rows...

At first, I was using sqlite through shell. I hated it so much that I built a desktop app on top of it. It wasn't slick enough with all the typing.

It is quite a joy to use, and I'd love for people to try it out: https://superintendent.app (disclaimer: I'm the creator).


> When I tell them to do it in Excel by themselves, they would say Excel couldn't open a CSV larger than 1M rows...

Sure it can. Just use the "Import from Text/CSV" feature instead of directly opening the CSV. https://support.microsoft.com/en-us/office/what-to-do-if-a-d...


I was a user of superintendent until about a year ago (it doesn't make sense for my current job), and I liked it a lot. Tanin was responsive by email, and the tool was very helpful for standard CSV tasks. Well worth the price for me. If I ever need to do those kinds of tasks again, I'll gladly become a customer again.


Really appreciate your kind words! I'm glad we crossed path, though, based on your username, I can't tell who you are. haha.

To be honest, I'm still trying to find more a product market fit.

It feels like a travel planning app where people might need it once every few months, so it doesn't quite catch on.

On another hand, I do have a few passionate users who analyze 40+ CSVs at the same time, but I'm struggling to improve UX for these users. Building great UI is hard. But maybe this is what I need to do.


You’re hiding the price quite well (although at least you’re not hiding the fact that it has a price)

It’s $40/year for those wondering.


I wasn't trying to hide its price.

The price is very obvious on the website.

Users can't use it for an extended period of time without paying.

You said it like I was tricking people into paying somehow.


> The price is very obvious on the website.

Hate to agree, but yes right there, 100 pixels above the download link, very nice and clear site, I heartily approve. Not sure how you'll get $1b of funding for something that user friendly though -- I don't see any dark patterns at all!

> Does your workplace reimburse software's cost?

I don't need your app, but my workplace certainly doesn't. I can buy all sorts of hardware from preferred suppliers through our purchasing system, I can with a few hoops generate a one-time credit card to buy a random thing from a random person (how I pay for starlink for example), and for some specific cases I can pay myself and claim back later

However software is blocked from all this.

> Even if Superintendent.app only saves you 30 minutes per month, it's already worth the minimal price tag that we will charge after beta.

To buy software in my company would take far longer than the 6 hours that I'd save each year! This of course is entirely the fault of my company, but I wonder how many other companies have similar policies.

What I can do though is spin up a AWS ami with a software charge and nobody bats an eyelid. I'm not sure how much of a "cut" aws takes, but something to think about.

(If it really helped me I'd just eat the $40 personally and take a day off in lieu, but I've known some people put in receiptless claims to get around the policy)


I hope it didn't take too much of your time to ramble all of this.

One thing I'm interested in is the dark pattern you accused me of.

Users literally cannot use it beyond the trial period if they don't pay, and I don't collect credit card info before the trial.

Are you saying someone might accidentally pay for this?


I think you need to reread the post you’re complaining about


OMG $40/year?!? Most outfits would charge $40/month/user/basic, add $100/month/user/professional, $CALLUS for enterprise.

Almost a breath of fresh air!


Right on the main page under a Pricing subheader...


There's a nice tool called dplyr-cli which lets you use R's dplyr data manipulation language on the command line.

    cat mtcars.csv | group_by cyl | summarise "mpg = mean(mpg)" | kable
    #> | cyl|      mpg|
    #> |---:|--------:|
    #> |   4| 26.66364|
    #> |   6| 19.74286|
    #> |   8| 15.10000|
https://github.com/coolbutuseless/dplyr-cli


These days I just use powershell. It has built-in csv import and then I'm dealing with familiar pwsh/.net objects. I don't miss the days of learning bespoke tools to handle slightly different cases.


Shameless plug: I created prql-query (https://github.com/PRQL/prql-query) in order to scratch my own itch and use PRQL (prql-lang.org) with DataFusion and DuckDB for things like this.

pq is overdue some maintenance but I will update it with the imminent PRQL 0.9 release.

I've used pq in anger at my $dayjob and found it incredibly productive to have the full power of SQL combined with the terse and logical syntax of PRQL.


I have this shell function defined:

csv_to_json () { python -c 'import csv, json, sys; print(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)]))' | jq . }

It converts a csv to a json list of objects, mapping column names to values. I find it way easier to then operate on json by filtering with jq or gron, or just pasting it into other tools for post-processing. The jq at the end isn't necessary but makes for nice formatting!


I'm two weeks late, but you could do something like

   python -c 'import csv, json, sys; [print(json.dumps(r)) for r in csv.DictReader(sys.stdin)]' | jq -s .
to stream the csv rows instead of reading the whole file into ram.


That won’t scale well to files approaching or exceeding system RAM, though.

Also, unless it’s the coloring you’re concerned with, you could replace the pipe to jq with pprint, a native Python library.


pprint won't work here because the output is a json string, not a python object, so pprint'ing it will just print the string as is. And plus, pprint doesn't output valid json (it uses single quotes, for example), so the output isn't usable for piping into other filters and programs.

And it's true that it won't scale well but it still has worked every time I've wanted it to :)


Lots of tools do this sort of thing. An incomplete list is in the vnlog docs: https://github.com/dkogan/vnlog/#description


Very nice … will see what I can snaffle for adding to this list: https://github.com/secretGeek/AwesomeCSV

(With credit of course, if I do so.)


Ruby's CSV module can be handy:

    ruby -rcsv -ne 'CSV($<).each { |r| puts r[0] }' 
I like the seen example the dude has:

    '!seen[$1]++'


As someone who's new to programming and doesn't know Ruby at all, can you (or someone) please explain how it works, what it does, and how would someone use it?

I find it beautiful, tho :)


    -rcvs
Load the (built-in) CSV module in Ruby.

    -e
Eval the following string as Ruby code.

    CSV($<)
Create CSV parser with standard input `$<` as the source.

    .each 
Run the code in the block that follows for each row (automatically skips the CSV header):

    { |r| puts r[0] }
For a row, print the first/0th element. Can be simplified in recent Ruby:

    { puts _1[0] }


Thank you so much! This got me looking at the Ruby documentation about blocks. Very cool feature!


I didn’t know you could do that in ruby but its pretty cool


Nushell can open and write CSV. Those AWK commands look horrible compared to the nu syntax.

https://www.nushell.sh/commands/docs/from_csv.html


Nu is using a data frame abstraction to drive this and it’s really quite powerful and fast.

I was rather skeptical on the value of having this stuff as a shell built in, but it won me over. Very convenient!


If you use powershell, you can directly run Import-Csv and don’t really have to think about it.


Many command-line CSV parsing tools mentioned here, adding my choice to the list:

http://harelba.github.io/q/


For field extraction (like the first example: `awk -F, '{print $1}'`), you can also use `cut -d, -f1`

GNU datamash (https://www.gnu.org/software/datamash/) provides features like groupby, statistical operations, etc.

See also this free ebook: Data Science at the Command Line (https://jeroenjanssens.com/dsatcl/)


I think the lesson here is don’t use awk for CSV. Instead, use one of the many tools discussed in comments that knows how to handle CSV.

Some are very close to awk in spirit, like my own attempt: `pawk` (1). It will parse your csv just fine. Or tsv. Or JSON or YAML or TOML. Or Parquet, even.

1: https://github.com/jean-philippe-martin/pawk


Many of the tools shared in this thread simplify working with CSV files, but only some allow running proper SQL queries.

SQLite, DuckDB and Clickhouse-local have been mentioned, but another very simple one, a single dependency-free binary, is https://github.com/multiprocessio/dsq

Not affiliated, just a happy user


Just to note, ClickHouse is also dependency-free single binary, following https://notes.volution.ro/v1/2022/01/notes/fbf3f06c/

You can get it as

    https://clickhouse.com/ | sh


Just thought I'd plug Octosql[1] which I've enjoyed using for this. It parses CSV and JSON, which are the file types I parse the most.

[1] https://github.com/cube2222/octosql/


'tr' and 'cut' are also very useful; 'tr' can be used to get rid of extra spaces and to conver commas to spaces and vice versa (and to handle text data with pretty much any character used as a separator).

  'cat $SPACE_SEPARATED_FILE | tr -s ' ' | tr ' ' ',' > out.csv
Allied with 'cut', it becomes easy to pull particular fields out of a text file:-

  'cat $FILE_WITH_COMMAS_AND_SPACES | tr ',' ' ' | tr -s ' ' | cut -d ' ' -f1,2,17 > out.txt


I recommend everyone checkout the very cool csvtk by the amazing bioinformstician Shen Wei

https://github.com/shenwei356/csvtk


Presumably this command:

awk -F, '{print $1}' file.csv

Doesn't work if the data contains commas (with escapes)? If so, that might be worth spelling out.


I've defaulted to doing all my data munging in a clojure repl.

The datasets that I work with are well small enough to fit in memory on my machine and having the full java ecosystem + clojure ergonomics is worth more to me than the performance full db tooling might offer


I use miller, as it's available on my distro.


Python's pandas library has ability to seamlessly work with csv by importing/exporting them as data frames.


I don't understand why this is so high when it's just using awk to naively split on commas. Any programming language could do that! I was expecting an actual command line tip that can handle CSV files in general, looks like it's still basically impossible without resorting to a full blown CSV parser (none of which come installed by default).


I use PowerShell for this and it works very well.


Interesting but not very intuitive.


xsv works pretty well for me


Title should have been "Using awk to process CSV files".


Can’t wait for the tutorial on ‘ls’. It’s not like there is a manual.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: