Hacker News new | past | comments | ask | show | jobs | submit login

  sqlite> .mode csv
  sqlite> .header on
  sqlite> .import weight.csv weight
  sqlite> .import person.csv person
  sqlite> select * from person, weight where person.ID = weight.ID;
  ID,sex,ID,weight
  123,M,123,200
  789,F,789,155
  sqlite>



PostgreSQL can do it on live data without importing anything. Set up the "file_fwd" extension:

  CREATE EXTENSION file_fdw;
  CREATE SERVER fdw FOREIGN DATA WRAPPER file_fdw;
Declare tables:

  CREATE FOREIGN TABLE person (id int, sex text)
    SERVER fdw
    OPTIONS (filename '/tmp/person.csv', format 'csv');
  CREATE FOREIGN TABLE weight (id int, weight int)
    SERVER fdw
    OPTIONS (filename '/tmp/weight.csv', format 'csv');
Now:

  SELECT * FROM person JOIN weight ON weight.id = person.id;


In R this is rather straightforward, and you can also achieve it without additional libraries:

  weight <- read.csv("weight.csv")
  person <- read.csv("person.csv")
  merge(weight, person, all=TRUE)
Of course, nowadays you would use data.table, but still the merging logic would be exactly the same.


That loads all the data into memory, though. Fine if you're joining all of it, but with the Postgres example you can throw the whole SQL language at the data, and it will be streamed and pipelined efficiently just like with normal Postgres tables.


I never got deep into R as we had python or R as options in school for data projects so forgive me but where is the database join in this scenario?


Original article doesn't say anything about "database join". It's about joining two datasets by some common ID.

R in this case fits the bill and even allows for some relational algebra here (e.g. INNER JOIN would be merge(X, Y, all=FALSE), LEFT OUTER JOIN: merge(X, Y, all.x=TRUE), etc...)


Merge joins the two dataset on identical columns. All=True is an outer join.


This isn't "without importing anything". This is "importing all the data again every time you run a query".


No, "import" means "create state from data". It's not creating any persistent state at all.

But it's an important nuance for another reason, because you can operate on the original files in real time. You can update the underlying files and your queries will reflect the changes.

The downside is since the data is streamed at query time, it's less efficient and doesn't allow indexes to be created.


That's reasonable.


I can't speak for XSV but when I had to do similar analysis of logfiles in range of hundreds of thousands of records - SQLite would consistently seize up on my machine and make it impossible to do analysis. Had i know about XSV I may have tried it, but GNU Join saved my bacon that week and would do my analysis in minutes.


That's why you can `EXPLAIN` your query, and maybe even add indices, to speed things up!

Plus in the last few years SQLite has gotten a lot more powerful, and has added lots of JSON support and has improved index usage. I'd avoid working with lots of data as flat ASCII text, because of all the I/O and wasted cycles to read the data and insert it into some data structure and then write it out as flat ASCII text, but it can be super debuggable.


> GNU Join

The join utility is actually a part of POSIX[1], so every UNIX should have one. Here's one from OpenBSD[2], for example. The GNU version probably has more flags though.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/j...

[2] https://man.openbsd.org/join.1


Plus POSIX join requires pre-sorted inputs.

BTW, there's never a bad time to mention http://johnkerl.org/miller/doc/ ...


Yeah my colleague went down the rabbit hole and swore never again. Good luck with carriage return bugs!


Did you create indices on the columns you wanted to join?


I was trying to join two 300M lines CSV files using SQLite and Postgres and it took days to merge them on c4.xlarge.

Later, I tried Miller and it took less than 4-6 hours.

It also takes some time to tune Postgres to make it faster for this particular task.


Seems like you were doing something naively. Probably would have been much faster if you'd written a simple conversion script to do batch inserts.

Perhaps something pathological like commiting (or flushing to disk) after every row?


4-6 hours still seems like a long time for this, did you have sorted indices on the join columns? How high of cardinality (distinct counts) did the join columns have? Traditional RDBMS are pretty fast at joining when you have the right indices in place, especially when the cardinality of the join columns isn't very high.

And how much of that time was just getting the data into the tables? There are fast and slow ways to do that, too...


IIRC, it took me 4-7 hours to load 600GB of uncompressed data to the PostgreSQL after disabling WAL and tunning some config variables. I didn't have time to test all possible speed improvements.

After import, it takes 3-6 hours additionally to create indexes for those tables.

I think the main problem was that I had text indices. I'm not an expert in RDBMS and used a simple join that was taking ages to start producing the actual data. There are definitely a lot of ways to tune such queries and PostgreSQL configs, but I wanted a simple and universal solution.


On average each line was 1GB of data!? Can you give a rough description of the content of the two files?


Not GP and nothing to do with them, but I have an example of a CSV format which yields absurdly long lines.

We have a CSV for timeseries data, where for some reason someone decided to force each line to represent one data point. However, for some cases, one data point may contain around 30 different statistics for 4 million different IPs, which get represented as 120 million columns in the CSV (think a CSV header like 'timestamp,10.0.0.1—Throughput,10.0.0.1-DroppedBytes,[...],10.215.188.251-Throughput,[...]'). With large numbers represented as text in a CSV, this can sometimes reach more than one 1GB per line.


600G / 300M = 2K


Woof, yeah, bad arithmetic by me. Thanks!


When I've had to do similar, the easiest and fastest way has always been install spark -> Create an external spark sql table from the csvs -> use spark sql to join and aggregate/write. If you haven't used spark before it sounds complicated but its actually really quick and easy.

In general the operation on my laptop can get up to 200 MB/s, it's basically IO limited to the SSD of your machine.


Would be nice if you could write a small article about that.


Similarly, I've used Prestoql for this. It's similar to Spark and Hive.


This is a perfect candidate for XSV. You might want to try it. I'd suggest do one try after creating an index using XSV and the second one without.

I feel you will be impressed with the performance.


This would take seconds to minutes with BigQuery or Athena.


It would still take a long time to actually send all that data to BigQuery


Since the OP noted that they were trying to accomplish this task on a c4.xlarge machine, the data is already "in the cloud" and so transfer should be blazingly fast.

Using the zipped CSV at [1] as a size estimate, I'd ballpark 300 million CSV rows as somewhere between 6.7-10.6 GB:

    (300000000/500000) * 11.10 = 6.7 GB
    (300000000/500000) * 17.68 = 10.6 GB
That could one minute to transfer. (Obviously if the rows contain a ton of data, these size estimates could be off, in an unbounded way.)

1 = http://eforexcel.com/wp/downloads-18-sample-csv-files-data-s...


What is Miller ?


In PostgreSQL we could do that in parallel:

https://www.postgresql.org/message-id/CA+hUKGKzn3mjCAp=TDrji...


You might want to save the query result to a file too:

    .once out.csv
    select * from person, weight where person.ID = weight.ID;
    .mode column


This converts everything to "text" type though and makes many queries jacked up.


Cool - you could create a bash alias called xsv for that.


You should probably create an index on the tables first and save the result into a temporary table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: