sqlite> .mode csv sqlite> .header on sqlite> .import weight.csv weight sqlite> ....

atombender · on Jan 1, 2020

PostgreSQL can do it on live data without importing anything. Set up the "file_fwd" extension:

  CREATE EXTENSION file_fdw;
  CREATE SERVER fdw FOREIGN DATA WRAPPER file_fdw;

Declare tables:

  CREATE FOREIGN TABLE person (id int, sex text)
    SERVER fdw
    OPTIONS (filename '/tmp/person.csv', format 'csv');
  CREATE FOREIGN TABLE weight (id int, weight int)
    SERVER fdw
    OPTIONS (filename '/tmp/weight.csv', format 'csv');

Now:

  SELECT * FROM person JOIN weight ON weight.id = person.id;

haddr · on Jan 1, 2020

In R this is rather straightforward, and you can also achieve it without additional libraries:

  weight <- read.csv("weight.csv")
  person <- read.csv("person.csv")
  merge(weight, person, all=TRUE)

Of course, nowadays you would use data.table, but still the merging logic would be exactly the same.

atombender · on Jan 1, 2020

That loads all the data into memory, though. Fine if you're joining all of it, but with the Postgres example you can throw the whole SQL language at the data, and it will be streamed and pipelined efficiently just like with normal Postgres tables.

nkozyra · on Jan 1, 2020

I never got deep into R as we had python or R as options in school for data projects so forgive me but where is the database join in this scenario?

haddr · on Jan 1, 2020

Original article doesn't say anything about "database join". It's about joining two datasets by some common ID.

R in this case fits the bill and even allows for some relational algebra here (e.g. INNER JOIN would be merge(X, Y, all=FALSE), LEFT OUTER JOIN: merge(X, Y, all.x=TRUE), etc...)

stilisstuk · on Jan 1, 2020

Merge joins the two dataset on identical columns. All=True is an outer join.

kragen · on Jan 2, 2020

This isn't "without importing anything". This is "importing all the data again every time you run a query".

atombender · on Jan 2, 2020

No, "import" means "create state from data". It's not creating any persistent state at all.

But it's an important nuance for another reason, because you can operate on the original files in real time. You can update the underlying files and your queries will reflect the changes.

The downside is since the data is streamed at query time, it's less efficient and doesn't allow indexes to be created.

kragen · on Jan 2, 2020

That's reasonable.

geogra4 · on Dec 31, 2019

I can't speak for XSV but when I had to do similar analysis of logfiles in range of hundreds of thousands of records - SQLite would consistently seize up on my machine and make it impossible to do analysis. Had i know about XSV I may have tried it, but GNU Join saved my bacon that week and would do my analysis in minutes.

lsb · on Dec 31, 2019

That's why you can `EXPLAIN` your query, and maybe even add indices, to speed things up!

Plus in the last few years SQLite has gotten a lot more powerful, and has added lots of JSON support and has improved index usage. I'd avoid working with lots of data as flat ASCII text, because of all the I/O and wasted cycles to read the data and insert it into some data structure and then write it out as flat ASCII text, but it can be super debuggable.

ainar-g · on Dec 31, 2019

> GNU Join

The join utility is actually a part of POSIX[1], so every UNIX should have one. Here's one from OpenBSD[2], for example. The GNU version probably has more flags though.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/j...

[2] https://man.openbsd.org/join.1

dmd · on Jan 1, 2020

Plus POSIX join requires pre-sorted inputs.

BTW, there's never a bad time to mention http://johnkerl.org/miller/doc/ ...

ramraj07 · on Jan 1, 2020

Yeah my colleague went down the rabbit hole and swore never again. Good luck with carriage return bugs!

ashkankiani · on Jan 1, 2020

Did you create indices on the columns you wanted to join?

f311a · on Dec 31, 2019

I was trying to join two 300M lines CSV files using SQLite and Postgres and it took days to merge them on c4.xlarge.

Later, I tried Miller and it took less than 4-6 hours.

It also takes some time to tune Postgres to make it faster for this particular task.

TylerE · on Dec 31, 2019

Seems like you were doing something naively. Probably would have been much faster if you'd written a simple conversion script to do batch inserts.

Perhaps something pathological like commiting (or flushing to disk) after every row?

kyllo · on Jan 1, 2020

4-6 hours still seems like a long time for this, did you have sorted indices on the join columns? How high of cardinality (distinct counts) did the join columns have? Traditional RDBMS are pretty fast at joining when you have the right indices in place, especially when the cardinality of the join columns isn't very high.

And how much of that time was just getting the data into the tables? There are fast and slow ways to do that, too...

f311a · on Jan 1, 2020

IIRC, it took me 4-7 hours to load 600GB of uncompressed data to the PostgreSQL after disabling WAL and tunning some config variables. I didn't have time to test all possible speed improvements.

After import, it takes 3-6 hours additionally to create indexes for those tables.

I think the main problem was that I had text indices. I'm not an expert in RDBMS and used a simple join that was taking ages to start producing the actual data. There are definitely a lot of ways to tune such queries and PostgreSQL configs, but I wanted a simple and universal solution.

creddit · on Jan 1, 2020

On average each line was 1GB of data!? Can you give a rough description of the content of the two files?

tsimionescu · on Jan 1, 2020

Not GP and nothing to do with them, but I have an example of a CSV format which yields absurdly long lines.

We have a CSV for timeseries data, where for some reason someone decided to force each line to represent one data point. However, for some cases, one data point may contain around 30 different statistics for 4 million different IPs, which get represented as 120 million columns in the CSV (think a CSV header like 'timestamp,10.0.0.1—Throughput,10.0.0.1-DroppedBytes,[...],10.215.188.251-Throughput,[...]'). With large numbers represented as text in a CSV, this can sometimes reach more than one 1GB per line.

fourthark · on Jan 1, 2020

600G / 300M = 2K

creddit · on Jan 1, 2020

Woof, yeah, bad arithmetic by me. Thanks!

faizshah · on Jan 1, 2020

When I've had to do similar, the easiest and fastest way has always been install spark -> Create an external spark sql table from the csvs -> use spark sql to join and aggregate/write. If you haven't used spark before it sounds complicated but its actually really quick and easy.

In general the operation on my laptop can get up to 200 MB/s, it's basically IO limited to the SSD of your machine.

lolive · on Jan 1, 2020

Would be nice if you could write a small article about that.

ecnahc515 · on Jan 1, 2020

Similarly, I've used Prestoql for this. It's similar to Spark and Hive.

santa_boy · on Jan 1, 2020

This is a perfect candidate for XSV. You might want to try it. I'd suggest do one try after creating an index using XSV and the second one without.

I feel you will be impressed with the performance.

carbocation · on Jan 1, 2020

This would take seconds to minutes with BigQuery or Athena.

nojvek · on Jan 1, 2020

It would still take a long time to actually send all that data to BigQuery

carbocation · on Jan 1, 2020

Since the OP noted that they were trying to accomplish this task on a c4.xlarge machine, the data is already "in the cloud" and so transfer should be blazingly fast.

Using the zipped CSV at [1] as a size estimate, I'd ballpark 300 million CSV rows as somewhere between 6.7-10.6 GB:

    (300000000/500000) * 11.10 = 6.7 GB
    (300000000/500000) * 17.68 = 10.6 GB

That could one minute to transfer. (Obviously if the rows contain a ton of data, these size estimates could be off, in an unbounded way.)

1 = http://eforexcel.com/wp/downloads-18-sample-csv-files-data-s...

anjanb · on Jan 2, 2020

What is Miller ?

macdice · on Dec 31, 2019

In PostgreSQL we could do that in parallel:

https://www.postgresql.org/message-id/CA+hUKGKzn3mjCAp=TDrji...

thibran · on Jan 1, 2020

You might want to save the query result to a file too:

    .once out.csv
    select * from person, weight where person.ID = weight.ID;
    .mode column

7thaccount · on Dec 31, 2019

This converts everything to "text" type though and makes many queries jacked up.

adele11 · on Dec 31, 2019

Cool - you could create a bash alias called xsv for that.

ashkankiani · on Jan 1, 2020

You should probably create an index on the tables first and save the result into a temporary table.