Question: why the hell do people use CSVs? Use tabs, goddammit. Documents are littered with commas. It's trivial to parse a TSV with awk - not so a CSV. There is literally no reason to prefer CSV to TSV, and yet it seems the former is the default for many people. Why? As a guy who parses text files all day, I never ever want to see a CSV again. End rant.
Pretty much every programming language in existence has a utility for parsing CSVs into tabular data while handling commas sanely. (using awk for processing is not a typical pipeline)
More importantly, CSVs are handled natively by statistical programs such as Excel and R (you can change the expected delimiter while importing, but in my experience that leads to unexpected results)
Microsoft often fails with CSVs actually. I've found I can't copy/paste from Sql Server Management Studio into Excel if you have a mismatched double-quote in a cell - it fails silently and garbles your data.
That was embarrassing.
The CSV format is so messy that even big corporate implementations will have subtle bugs like that. Do not trust it.
Microsoft excel has the worst CSV defaults. The separating character is even locale-specific! Many locales use semicolon ";" instead of "," as a separator. Great stuff when creating web app exporters.
Network effect: Database upload/export, spreadsheets, etc... I used to hate them too until I realized how ubiquitous and well supported they are.
comma vs. tab: It doesn't matter, in the end, you have to use a character to separate fields, and if you are going to use a "normal" character, there are chances that it might appear in your text, so you have to either escape that character, or quote your text, which csv does in a fairly standard way, so comma vs tab doesn't really matter.
I was going to add a sentence to my comment... They were easy to input via control-caret, etc., until various programs repurposed those combinations. I think a bigger problem is that there is no standard way to display them.
(BTW, if inputtability mattered at all, Unicode wouldn't exist. ;-)
I'm not sure what you mean by your last sentence. Sure, you might have to work a bit to be able to change your locale to Klingon, and input that -- but for something for which unicode is actually useful, like Japanese, there's been input solutions for a long time (predating unicode, as you need a wide character set for such languages).
And unicode is of course easy to use with "less exotic" languages like Norwegian that can't be represented in basic 127-bit ascii -- via a simple keyboard map.
I can think of at least one reason. Copy-pasting from a terminal fails to preserve tabs, while commas are fine.
Tabs might be somewhat less frequent than commas in data, but you're still gonna have just as bad a time when they do turn up. And don't forget quoting and escaping. Without proper parsing you can neither read CSV nor TSV safely. My go to solution is pandas.read_table()
Yeah, this is the deciding factor for me, too. The point of delimited text files is to find some compromise between machine-efficency and human readability. CSVs are generally much less readable than TSVs, but have less ambiguity, too. With TSVs, it's hard to tell if a tab space is actually a tab or just multiple spaces.
And with developers (especially Python devs) setting their text editors to translate tabs-into-spaces, copy-pasting TSVs results in data corruption. Or even opening a TSV in your text editor with default settings.
Why use a character like tab, which can also appear in common text, when there are control characters designed exactly for this purpose? ASCII 0x1C - 0x1F are there specifically for machine parsing of tabular data.
The company I work at uses the pipe character('|') everywhere. It's fairly painless, and you can edit the files using standard text editors. It's also pretty easy for clients to provide or consume pipes. CSVs are terrible - people always have trouble providing or ingesting them with the correct quoting: probably 50% of the time it's suggested there is strong resistance to the idea of using CSV quoting. TSVs are okay - you can copy-paste from Excel into your terminal so it's got that going for it.
There's been one time where a client provided a couple hundred gigabytes of data where all the reasonable characters one might use were used in the fields themselves, including their own separator character(the comma). So I made up some rules for guessing which commas were delimiters and which were part of a field value, and stored it using the ASCII control character as a delimiter. The only other use of this character at my company are in jokes, like "We should ask the client to provide or ingest files separated using the ASCII control codes" which always gets a laugh.
It's trivial to change all those commas to something else before making the CSV file and importing into the database.
They can then be converted back to commas by processing the query result.
This type of conversion before storage is routinely done with other characters, such as newlines, quotes, etc. For example, look at the JSON for the 10mHNComments data dump.
Personally, from a readability standpoint, tabs drive me nuts. I am glad TSV is not the default.
csv is a garbage format but at least it's a standard, and one with a lot of tool support, and that can handle embedded commas in data if processed correctly. The idea that tab separated data is a superior format is, frankly, absurd. I could understand a call for, say, json or even xml, but an ad hoc standardless format which has the solitary advantage of being incrementally easier to parse with awk? No thanks.
Use tildes. Tilde Separated Values are relatively robust and nobody uses tildes by themselves for anything I've ever seen. Most tools that let you set a delimiter will accept tildes also.
And there are other possible irregularities: zero or N spaces after the comma separators; unquoted values when they're not needed; backslash-escaped special characters; escaping newlines.
To figure out what's the best time to post to reddit you want the number of successful submissions divided by the total number of submissions in that time period, not just the count of successful submissions. Otherwise you may just be measuring which time slot has the most submissions.
Really cool analysis. It's especially fascinating to see how BigQuery is so simple to parse and query and can easily give you access to powerful information like "What's the best time to post this on reddiy?"
Man those charts are beautiful, and this coming from someone who knows R graphics and ggplot pretty well. Please may I ask a) what platform (OS/X / Windows / Linux?) and b) what are the axis and title fonts?
You've managed to make even ggplot's already excellent default outputs look even better. Love the darker grey gridlines, the unified grey panel background which includes the headline, the alpha < 1 so the grid shows through, and especially the choice of axis fonts. Nice touch on the footnote too. This is what professional charts are supposed to look like.
it should now be possible to get AA on both Windows and Linux via the Cairo-library.
I'll be playing a bit with R notebooks for my stats intro class, if I find the time I'll try to see if a) I get AA out of the box, or b) I can get AA easily by setting some parameters.
To get back to the point of the article, is 130MM rows past the point where you should start reaching for "big data" tools? I recently did an ad-hoc analysis of a 50MM row, 3GB dataset and saw similar performance to the OP with a local postgres database. Is it worth using specialized analysis tools for exploring these types of datasets? I ended up writing a script to slice the data, but with `wc -l` taking 2 minutes on my SSD, it wasn't a great interactive experience. What do other people use for this stuff?
Slightly different domain, but I recently had to work on a system that retained email addresses that were known targets of aggressive spamming (450MM rows).
In a nutshell, the idea was to compare lists of emails and test to see how many accounts in the given list were also in the database. We'd typically get 10k item lists to test against the database.
Initially, I was creating a temporary table and doing a unique intersection on it. That took about 45 minutes on a modest machine. Next, I hacked up a bloom filter search (since all we cared about was whether or not the items in the sample were in the database); that ran in a little under a second for a set of 10k, but got a little unwieldy with sets approaching 1MM+.
We talked about it, and decided to just use Elasticsearch, and to just split the database into a bunch of buckets. Search time went up a small bit for small samples, but went waaay down for larger samples.
Also of note, going from spinning platters to SSD did make a pretty huge improvement in things when we were using a relational database. Now that it's all basically in memory, it doesn't matter so much.
Even old DDR-200 (aka PC-1600, 1600 MB/s)[1] is faster than the theoretical maximum of SATA III (600 MB/s). Wikipedia says SATA 3.2 can reach 1,969 MB/s, but the more recent DDR3-1600 is 12,800 MB/s and some DDR4 reach 21,300 MB/s [2] ... SSDs aren't even close.
DDR4 also theoretically supports 512 GB sticks [3] Can't wait.
Note that part of the beauty of BigQuery is how it allows you to go deeper into any dimension without pre-planning (AKA defining partitions or indexes).
Case in point, many comments in this page point at the "best time to post" chart. But what if that time varies by subreddit?