Unfortunately "awk -F," (field separator of comma) doesn't work with most real CSV files, because of quoted fields, commas in fields, and (less frequently) multiline fields. My GoAWK implementation has a CSV mode activated with "goawk -i csv" (input mode CSV) and some other CSV features that properly handle quoted and multiline fields: https://benhoyt.com/writings/goawk-csv/
The frawk tool (written in Rust) also supports this.
Interestingly, Brian Kernighan is currently updating the book "The AWK Programming Language" for a second edition (I'm one of the technical reviewers), and Gawk and awk are adding a "--csv" option for this purpose. So real CSV mode is coming to an AWK near you soon!
This is a common theme with the classic unixy text processing based workflows, they work really nicely when the incoming data is nice and simple but fall apart on edge cases. And because often also the error handling is incomplete it might be difficult to know that the pipeline is producing garbage output.
Related example is the recently discussed problems with filenames, and the many pitfalls that scripts can stumble upon when handling them.
The problem is that edge cases are not really edge cases… in certain locales. ;) Someone could be using commas as the decimal separator since that’s how they do it ’round these parts.[1][2] So it’s not like they’re storing “weird” stuff in tabular format, like HTML blog articles; they’re just storing numbers.
Of course, Excel and other software has tried to be way too clever about it with different separators depending on locale, which makes the whole pipeline complex and context-dependent.
[1] A surprising amount of countries have this convention, including my own… doesn’t make sense to me since it becomes harder to list numbers.
[2] And the number one selling point of CSV is that it is “human editable”, so of course someone is going to do that at some point in the pipeline.
> And because often also the error handling is incomplete it might be difficult to know that the pipeline is producing garbage output.
The thing is that from the point of view of each individual tool, there isn't even an error! You ask awk to split on commas and give you the second field, it splits on commas and gives you the second field. It opened the file no problem, it didn't run out of memory writing its output to stdout... There were no errors to speak of from what it could see. The "error" is an error in the overall logic of the code.
That is correct, the data needs to be simple where the delimiter characters are never embedded inside a quoted field. I wrote a simple (and fast) utility to ensure that CSV files are handled properly by all the standard UNIX command line data tools. If you like using awk, sed, cut, tr, etc. then it may be useful to you.
Another problem is CSV isn't really a text format. It's a binary format that just happens to look like text and most implementations will handle it if you edit it as text. But the record separator is meant to be specifically CRLF and there might be other complications. This can be confusing when using these tools to try to output another CSV. Do the `--csv` modes support also writing back to CSV?
No, unfortunately the Gawk/awk --csv option doesn't support writing back to CSV. FWIW, GoAWK supports CSV input mode with "-i csv" and CSV output mode with "-o csv" for this purpose, as well as special syntax for fetching a @"named-field" rather than a $numbered_field. Docs: https://github.com/benhoyt/goawk/blob/master/docs/csv.md
2nd edition?! I just purchased a copy of the first and it's fantastically concise. I've always had a great experience with books Kernighan has been a part of, and it's awesome to see that the book's team are domain experts. Looking forward to it!
I think “most” is pulling a lot of work there. I make and process csvs a lot, and just avoid quotes when generating them.
Sure technically you can do things like having new lines on a line which cause all sorts of grief, but if you control the end to end workflow just avoid it.
I have a tool which monitors performance of loading various websites via various proxies at various points ok our network with pupetteer. That outputs a csv which I can then extract data from and make handy graphs to show manglement.
Yeah, I guess that kinda works if you're the one generating them. But you must not have too many generalised strings -- like a name field containing "Smith, John" or a book title like "The Lion, the Witch and the Wardrobe"? When I deal with CSV's those kind of things are quote common. I actually prefer TSV because you can (more easily) process them with standard AWK.
What do you output for a string field with ',' in it? Just strip the commas, or replace it with something else?
Quotes aren't the problem, it's having an in-band separator character. Maybe you're lucky enough to only need naive splitting on commas, but lots of us need to jam in human-produced text stings where the punctuation needs to be preserved without breaking file parsing.
Also localized csv files in a language where the comma is the decimal separator (and the field separator is ; ), column headers and unexpected character encodings. CSV is truly the worst.
For any kind of tabular data, I use DuckDB (parallel CSV reads, works on small to humongous CSVs). It supports full Postgres-like SQL, so you can do arbitrarily complex manipulations on columns and rows. The CSV essentially becomes a dataframe object that can be manipulated in performant ways (DuckDB vectorizes and paralllelizes, so it's way faster than awk on large CSV files). You can even manipulate multiple CSVs in a single statement (JOINs, UNIONs etc).
Yeah, sqlite has been my preferred method for working with csv files for a long time. I will say that one nice thing about DuckDB over SQLite is that it does a better job inferring column types than SQLite does.
I do use sqlite too to process csv file, but I either rely on csvkit to generate the schema from the csv or create the table with raw sql before inserting data in the sqlite database.
I think you can use a heuristic approach. E.g. pick a comma in the middle of the CSV, assume it is a delimiter (not inside data) and parse from there in a separate thread. If you find it doesn't parse correctly (perhaps because it is inside data) you might have to throw the results of that thread away. Kind of ugly but would probably allow faster parsing in 99% of cases.
CSVs are a bit of a mess. Non-trivial to parse properly, due to escaping, and the standard (such as it is) is poorly followed. E.g. sometimes people use semi-colon or pipe instead of comma.
I've found that for tabular data parquet can't be beat, the only time textual formats are superior are for human readabilty (which imo can be solved by tooling)
Edits might be a little tricky, there's no concept of writing in place in parquet afaict. But there's absolutely no reason why reads/creation can't be supported in text editors (I'm sure some do).
This is also my use case where I have GBs of data in S3 as Parquet and a small CSV file that I have to join them to.
And DuckDB reads CSVs and Parquet and SQLite and others. I can join all these heterogeneous data types in a single SQL statement and have the assurance that it’ll be done correctly.
You can't read SAME csv in a parallel manner but you can still read two different csvs in two separate threads? I think that's what they meant by "parallel csv".
I think the fundamental issue is that a single 'cell' (value) can contain newlines, so you can't just assume each line is a 'row' and trivially parallelize.
The problem is perhaps most obvious when you consider that the value of a single cell could itself be a CSV, and this could be recursive
With raw string parsing approaches (awk, perl) you have to handle all kinds of corner cases because CSV is an underspecified spec. CSV is deceptive in that it looks like it parses easily, but the truth is CSV is one of the most difficult formats to reliably parse. I mean, just look at the number of arguments that pandas.read_csv has.
Parsing a CSV by hand is almost always the wrong thing to do unless you also wrote the CSV generating process and can limit the variance.
But if not, instead of working with CSVs at the string level and risk running into issues, I've found that the best practice is to farm the CSV parsing out to a battle-tested engine (I use DuckDB, but sqlite, pandas, polars, clickhouse-local or any number of engines work), and work with the parsed object instead.
Better yet, avoid CSV and use DuckDB + Parquet (paired with visidata for quick vizes) for anything nontrivial.
With Perl, I wouldn't do raw string parsing; I'd use a Text::CSV (a battle-tested engine). For example, this filters a file foo.csv where `some_header` field is positive (you could also use sort or map or whatever instead of grep):
Have a look at visidata, a TUI table viewer/editor with vim keybindings. It's incredibly powerful and fully automatable - everything you do gets recorded as a series of commands which you can save and replay.
Not the right tool for everything, but it shines for quickly glancing at the shape of tabular data and making sense of it. Sorting, filtering, joins across files, column histograms and even column splitting/rejoining are all keystrokes away.
It groks anything even remotely table shaped like CSV, JSONL, JSON, even Excel files, it can even directly connect to databases and parse tables out of HTML.
I am a big fan of VisiData for working with data in the terminal. I also like John Kerl's miller[1] and Simon Willison's sqlite-utils [2] when I want to do queries. These are useful for working with data in simple shell scripts.
xsv is great for a quick sanity checks (i.e. number of columns, unique values counts in a given column) but for a more serious tasks/giant files I switch to either polars or duckdb converting CSV/TSV files to parquet or parquet data sets.
Maybe try the to [1] and sqlp [2] commands from the qsv fork. From the README:
sqlp: Run blazing-fast Polars SQL queries against several CSVs - converting queries to fast LazyFrame expressions, processing larger than memory CSV files.
to: Convert CSV files to PostgreSQL, SQLite, XLSX, Parquet and Data Package.
Cool stuff! But it’s criminal to not call attention to JQ’s elder sibling CSVKit. It’s invaluable for playing with csvs. Much easier to parse out columns, allows you to generate new csvs and even merge them. More importantly, it allows SQL on csvs (via SQLite iirc) which empowers all sorts of csv shenanigans. The bash scripting this enables us incredible (good and bad).
I agree it's indeed one of the best tool to process csv files. Using basic coreutils command are going to be soon very limited and brittle to process csv files.
As an SRE, I used to install one tool for every data format until I came across clickhouse-local. Now I use it for everything.
- It works with every data format which I come across in my daily job. It support data formats like Protobuf, Avro, Cap'n Proto which regular tools don't support. Funny thing, it can even read mysql dumps.
- I can read the data stored in local or remote location like http, s3, gcs, azure and whatever location I can think of.
- I can SQL queries on the raw data and improve my SQL skills on daily basis.
We bought the clickhouse.com domain two years ago, and it was listed in some adblockers due to abuse by the previous owner. We have cleaned up most of the block lists, but there is still a chance it is filtered by your ad-blocker.
Whenever CSV comes up I always feel a bit sad. ASCII includes a set of four out-of-band delimiters[0] that can be used instead of silly formats like CSV that use in-band delimiters which necessitate complicated quoting rules.
You can't just treat CSV as text. It's not and woe betide you if you ever use this stuff in a script instead of using a proper CSV parsing tool. If we used the ASCII delimiters instead it would be possibly to treat it as text and stuff like this would work.
Preach; the POSIX standard also understands this, and partitions control characters into portable and non-portable ones; the ASCII field/record/group delimiters are designated as non-portable, and therefore can be safely used to delimit portable character data. The fact that the whole file is then non-portable is a feature, not a bug; if you need to pack an entire table into a field, encode it in some portable encoding (eg base64) and add a decoder field.
The "arguments" against this are absurd; mainly that the resulting file is not plain text and cannot be read/modified immediately by plain text editors. This ignores the fact that plain text editors already deal with control characters because they have to decide what to do with <tab>, and makes a mountain out of a molehill when the in-band horrors of escaping delimiters create trouble every single day.
> The fact that the whole file is then non-portable is a feature, not a bug; if you need to pack an entire table into a field, encode it in some portable encoding (eg base64) and add a decoder field.
I guess you could:
1. Use unit-separator and record-separator for text
2. Use group-separator for (1)
3. Use file-separator for (2)
> The "arguments" against this are absurd; mainly that the resulting file is not plain text and cannot be read/modified immediately by plain text editors. This ignores the fact that plain text editors already deal with control characters because they have to decide what to do with <tab>, and makes a mountain out of a molehill when the in-band horrors of escaping delimiters create trouble every single day.
I agree. And some tools support it, e.g. Miller.[1]
There are a lot of power tools for CSV. All they need to do to support this format is to support it as input and output. It should be possible for example to add support to Csvkit[2] as input/output ("format").
Just based on HN comments alone (these things always come up) there is a decent demand for this kind of format. And there are a few implementations out there. But the implementations often look more like standards or proofs of concepts rather than full-fledged implementations. (A full-fledged implementation should probably provide functions/programs to convert to and from CSV as described in RFC-4180.)
The only thing needed for this to work really well for text editors to treat the end of record character like a carriage return (I'm not aware of any that do).
Agreed here, it's impressive how easy it is to use and how performant it is.
Would be nice to add a seamless ability to call executable UDFs from clickhouse-local, last time I checked clickhouse-local required executables to be in a special directory (as in proper clickhouse). Instead it'd be nice to be able to reference any executable in an ad-hoc way.
I was a user of superintendent until about a year ago (it doesn't make sense for my current job), and I liked it a lot. Tanin was responsive by email, and the tool was very helpful for standard CSV tasks. Well worth the price for me. If I ever need to do those kinds of tasks again, I'll gladly become a customer again.
Really appreciate your kind words! I'm glad we crossed path, though, based on your username, I can't tell who you are. haha.
To be honest, I'm still trying to find more a product market fit.
It feels like a travel planning app where people might need it once every few months, so it doesn't quite catch on.
On another hand, I do have a few passionate users who analyze 40+ CSVs at the same time, but I'm struggling to improve UX for these users. Building great UI is hard. But maybe this is what I need to do.
Hate to agree, but yes right there, 100 pixels above the download link, very nice and clear site, I heartily approve. Not sure how you'll get $1b of funding for something that user friendly though -- I don't see any dark patterns at all!
> Does your workplace reimburse software's cost?
I don't need your app, but my workplace certainly doesn't. I can buy all sorts of hardware from preferred suppliers through our purchasing system, I can with a few hoops generate a one-time credit card to buy a random thing from a random person (how I pay for starlink for example), and for some specific cases I can pay myself and claim back later
However software is blocked from all this.
> Even if Superintendent.app only saves you 30 minutes per month, it's already worth the minimal price tag that we will charge after beta.
To buy software in my company would take far longer than the 6 hours that I'd save each year! This of course is entirely the fault of my company, but I wonder how many other companies have similar policies.
What I can do though is spin up a AWS ami with a software charge and nobody bats an eyelid. I'm not sure how much of a "cut" aws takes, but something to think about.
(If it really helped me I'd just eat the $40 personally and take a day off in lieu, but I've known some people put in receiptless claims to get around the policy)
These days I just use powershell. It has built-in csv import and then I'm dealing with familiar pwsh/.net objects. I don't miss the days of learning bespoke tools to handle slightly different cases.
Shameless plug: I created prql-query (https://github.com/PRQL/prql-query) in order to scratch my own itch and use PRQL (prql-lang.org) with DataFusion and DuckDB for things like this.
pq is overdue some maintenance but I will update it with the imminent PRQL 0.9 release.
I've used pq in anger at my $dayjob and found it incredibly productive to have the full power of SQL combined with the terse and logical syntax of PRQL.
csv_to_json () {
python -c 'import csv, json, sys; print(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)]))' | jq .
}
It converts a csv to a json list of objects, mapping column names to values. I find it way easier to then operate on json by filtering with jq or gron, or just pasting it into other tools for post-processing. The jq at the end isn't necessary but makes for nice formatting!
pprint won't work here because the output is a json string, not a python object, so pprint'ing it will just print the string as is. And plus, pprint doesn't output valid json (it uses single quotes, for example), so the output isn't usable for piping into other filters and programs.
And it's true that it won't scale well but it still has worked every time I've wanted it to :)
As someone who's new to programming and doesn't know Ruby at all, can you (or someone) please explain how it works, what it does, and how would someone use it?
I think the lesson here is don’t use awk for CSV. Instead, use one of the many tools discussed in comments that knows how to handle CSV.
Some are very close to awk in spirit, like my own attempt: `pawk` (1). It will parse your csv just fine. Or tsv. Or JSON or YAML or TOML. Or Parquet, even.
Many of the tools shared in this thread simplify working with CSV files, but only some allow running proper SQL queries.
SQLite, DuckDB and Clickhouse-local have been mentioned, but another very simple one, a single dependency-free binary, is https://github.com/multiprocessio/dsq
'tr' and 'cut' are also very useful; 'tr' can be used to get rid of extra spaces and to conver commas to spaces and vice versa (and to handle text data with pretty much any character used as a separator).
I've defaulted to doing all my data munging in a clojure repl.
The datasets that I work with are well small enough to fit in memory on my machine and having the full java ecosystem + clojure ergonomics is worth more to me than the performance full db tooling might offer
I don't understand why this is so high when it's just using awk to naively split on commas. Any programming language could do that! I was expecting an actual command line tip that can handle CSV files in general, looks like it's still basically impossible without resorting to a full blown CSV parser (none of which come installed by default).
The frawk tool (written in Rust) also supports this.
Interestingly, Brian Kernighan is currently updating the book "The AWK Programming Language" for a second edition (I'm one of the technical reviewers), and Gawk and awk are adding a "--csv" option for this purpose. So real CSV mode is coming to an AWK near you soon!