PostgreSQL can do it on live data without importing anything. Set up the "file_fwd" extension:
CREATE EXTENSION file_fdw;
CREATE SERVER fdw FOREIGN DATA WRAPPER file_fdw;
Declare tables:
CREATE FOREIGN TABLE person (id int, sex text)
SERVER fdw
OPTIONS (filename '/tmp/person.csv', format 'csv');
CREATE FOREIGN TABLE weight (id int, weight int)
SERVER fdw
OPTIONS (filename '/tmp/weight.csv', format 'csv');
Now:
SELECT * FROM person JOIN weight ON weight.id = person.id;
That loads all the data into memory, though. Fine if you're joining all of it, but with the Postgres example you can throw the whole SQL language at the data, and it will be streamed and pipelined efficiently just like with normal Postgres tables.
Original article doesn't say anything about "database join". It's about joining two datasets by some common ID.
R in this case fits the bill and even allows for some relational algebra here (e.g. INNER JOIN would be merge(X, Y, all=FALSE), LEFT OUTER JOIN: merge(X, Y, all.x=TRUE), etc...)
No, "import" means "create state from data". It's not creating any persistent state at all.
But it's an important nuance for another reason, because you can operate on the original files in real time. You can update the underlying files and your queries will reflect the changes.
The downside is since the data is streamed at query time, it's less efficient and doesn't allow indexes to be created.
I can't speak for XSV but when I had to do similar analysis of logfiles in range of hundreds of thousands of records - SQLite would consistently seize up on my machine and make it impossible to do analysis. Had i know about XSV I may have tried it, but GNU Join saved my bacon that week and would do my analysis in minutes.
That's why you can `EXPLAIN` your query, and maybe even add indices, to speed things up!
Plus in the last few years SQLite has gotten a lot more powerful, and has added lots of JSON support and has improved index usage. I'd avoid working with lots of data as flat ASCII text, because of all the I/O and wasted cycles to read the data and insert it into some data structure and then write it out as flat ASCII text, but it can be super debuggable.
The join utility is actually a part of POSIX[1], so every UNIX should
have one. Here's one from OpenBSD[2], for example. The GNU version
probably has more flags though.
4-6 hours still seems like a long time for this, did you have sorted indices on the join columns? How high of cardinality (distinct counts) did the join columns have? Traditional RDBMS are pretty fast at joining when you have the right indices in place, especially when the cardinality of the join columns isn't very high.
And how much of that time was just getting the data into the tables? There are fast and slow ways to do that, too...
IIRC, it took me 4-7 hours to load 600GB of uncompressed data to the PostgreSQL after disabling WAL and tunning some config variables. I didn't have time to test all possible speed improvements.
After import, it takes 3-6 hours additionally to create indexes for those tables.
I think the main problem was that I had text indices. I'm not an expert in RDBMS and used a simple join that was taking ages to start producing the actual data. There are definitely a lot of ways to tune such queries and PostgreSQL configs, but I wanted a simple and universal solution.
Not GP and nothing to do with them, but I have an example of a CSV format which yields absurdly long lines.
We have a CSV for timeseries data, where for some reason someone decided to force each line to represent one data point. However, for some cases, one data point may contain around 30 different statistics for 4 million different IPs, which get represented as 120 million columns in the CSV (think a CSV header like 'timestamp,10.0.0.1—Throughput,10.0.0.1-DroppedBytes,[...],10.215.188.251-Throughput,[...]'). With large numbers represented as text in a CSV, this can sometimes reach more than one 1GB per line.
When I've had to do similar, the easiest and fastest way has always been install spark -> Create an external spark sql table from the csvs -> use spark sql to join and aggregate/write. If you haven't used spark before it sounds complicated but its actually really quick and easy.
In general the operation on my laptop can get up to 200 MB/s, it's basically IO limited to the SSD of your machine.
Since the OP noted that they were trying to accomplish this task on a c4.xlarge machine, the data is already "in the cloud" and so transfer should be blazingly fast.
Using the zipped CSV at [1] as a size estimate, I'd ballpark 300 million CSV rows as somewhere between 6.7-10.6 GB: