I once joked with a colleague that my sqlite3 install is faster than his Hadoop ...

pmlnr · on March 16, 2021

"Command-line Tools can be 235x Faster than your Hadoop Cluster"

https://web.archive.org/web/20200414235857/https://adamdrake...

Frost1x · on March 16, 2021

I recently did some data processing on a single (albeit beefy node) someone had been using a cluster for. I composed and ran ETL in a day what took them weeks in their infrastructure (they were actually still in the process of fixing it).

Aeolun · on March 16, 2021

At that point you can just get a bigger computer though.

ramraj07 · on March 16, 2021

No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

I have tried this, I literally wanted to create a simple web app that is powered by the cheapest solution possible, but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard! In the end I now launch redshift for a couple days, process all the data, then pipe it to Postgres running on a lightsail vps via dblink. Haven’t found a better solution.

trashtester · on March 16, 2021

My rule of thumb is that a single processor core can handle about 100MB/s, if using the right software (and using the software right). For simple tasks, this kan be 200+ MB/s, if there is a lot of random access (both against memory and against storage), one can assume about 10k-100k IOPS per core.

For a 32 core processor, that means that it can process a data set of 100G in the order of 30 seconds. For some types of tasks, it can be slower, and if the processing is either light or something that lets you leverage specialized hardware (such as a GPU), it can be much faster. But if you start to take hours to process a dataset of this size (and you are not doing some kind of heavy math), you may want to look at your software stack before starting to scale out. Not only to save on hardware resources, but also because it may require less of your time to optimize a single node than to manage a cluster.

xnx · on March 16, 2021

> "using the right software (and using the software right)"

This is a great phrase that I'm going to use more.

yowlingcat · on March 16, 2021

This is a great rule of thumb which helps build a kind of intuition around performance I always try to have my engineers contextualizing. The "lazy and good" way (which has worked I'd say at least 9/10 times in my career when I run into these problems) is to find a way to reduce data cardinality ahead of intense computation. It's 100% for the reason you describe in your last sentence -- it doesn't just save on hardware resources, but it potentially precludes any timespace complexity bottlenecks from becoming your pain point.

coldtea · on March 16, 2021

>No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

Yes, you can. Without indexes to slow you down (you can create them afterwards), it isn't even much different than any other DB, if not faster.

>Even Postgres by itself was very hard!

Probably depends on your setup. I've worked with multi-TB sized Postgres single databases (heck, we had 100GB in a single table without partitions). Then again the machine had TB sized RAM.

mschuster91 · on March 16, 2021

> but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard!

The PostgreSQL database for a CMS project I work on weighs about 250GB (all assets are binary in the database), and we have no problem at all serving a boatload of requests (with the replicated database and the serving CMS running on each live server, with 8GB of RAM).

To me, it smells like you've lacked some indices or ran on a rpi?

christophilus · on March 16, 2021

It sounds like the op is trying to provision and load 150GB in a reasonably fast manner. Once loaded, presumably any of the usual suspects will be fast enough. It’s the up front loading costs which are the problem.

Anyway, I’m curious what kind of data the op is trying to process.

ramraj07 · on March 16, 2021

I am trying to load and serve the Microsoft academic graph to produce author profile pages for all academic authors! Microsoft and google already do this but IMO they leave a lot to be desired.

But this means there are a hundred million entities, publishing 3x number of papers and a bunch of metadata associated. On redshift I can get all of this loaded in minutes and takes like 100G but Postgres loads are pathetic comparatively.

And I have no intention of spending more than 30 bucks a month! So hard problem for sure! Suggestions welcome!

shrubble · on March 16, 2021

There are settings in Postgres that allow for bulk loading.

By default you get a commit after each INSERT which slows things down by a lot.

ramraj07 · on March 16, 2021

How many rows are we talking about? In the end once I started using dblink to load via redshift after some preprocessing the loads were reasonable, and indexing too. But I’m looking at full data refreshes every two weeks and a tight budget (30 bucks a month) so am constrained on solutions. Suggestions welcome!

legg0myegg0 · on March 16, 2021

Try DuckDB! I've been getting 20x SQLite performance on one thread, and it usually scales linearly with threads!

Silhouette · on March 16, 2021

Maybe I'm misunderstanding, but this seems very strange. Are you suggesting that Postgres can't handle a 150GB database with acceptable performance?

ramraj07 · on March 16, 2021

I’m trying to run a Postgres instance on a basic vps instance with a single vcpu and 8gb of ram! And I’ll need to erase and reload all 150 GB every two weeks..

arnaudsm · on March 16, 2021

Had a similar problem recently. Ended up creating a custom system using a file-based index (append to files named by the first 5 char of the SHA1 of the key) Took 10 hours to parse my Terabyte. Uploaded it to Azure Blob storage, now I can query my 10B rows in 50ms for ~10^-7$. It's hard to evolve, but 10x faster and cheaper than other solutions.

ramraj07 · on March 16, 2021

My original plan was to do a similar S3 idea, but I forgot about it’s charge per 1000 gets and puts and had a 700 dollar bill I had to bargain with them to waive! Does azures model not have that expense?

arnaudsm · on March 16, 2021

I recall Azure was much cheaper and 30ms faster on average.

unchar1 · on March 16, 2021

Curious if you tried this on an EC2 instance in AWS? The IOPS for EBS volume are notoriously low, and possibly why a lot of self-hosted DB instances feel very slow vs similarly priced AWS services. Personal anecdote, but moving to a a dedicated server from EC2 increased the max throughput by a factor of 80 for us.

trashtester · on March 16, 2021

Did you try to compare that to EC2 instances with ephemeral nvme drives? I'm seeing hdfs throughput of up to several GB/node using such instances.

karmakaze · on March 16, 2021

You can use locally attached SSD instances. Then you're responsible for its reliability so not getting all the 'cloud' benefits. Used them for provisioning own CI cluster with raid-0 btrfs running PostgreSQL. Only backed up the provisioning and CI scripts.

ramraj07 · on March 16, 2021

Got burned there for sure! Speed is one thing but the cost is outrageous for io heavy apps! Anyways I moved to lightsail which doesn’t have io costs paradoxically so while io is slow at least the cost is predictable!

im3w1l · on March 16, 2021

A couple of 64gb ram sticks and you can fit your data in ram.

movedx · on March 16, 2021

It's all about the right tool for the job('s scale.)

StreamBright · on March 16, 2021

You can skip Hadoop and go from SQLite to something like S3 + Presto that scales to extremely high volumes with low latency and better than linear financial scaling.

bartimus · on March 16, 2021

Does hundreds of gigs introduce a general performance hit or could it still be further optimized using some smart indexing strategy?

z3t4 · on March 16, 2021

Everything is fast if it fits in memory and caches.

yarcob · on March 16, 2021

Unless it's accidentally quadratic. Then all the RAM in the world isn't going to help you.

Silhouette · on March 16, 2021

And in 2021, almost everything* does.

*Obviously not really. But very very many things do, even doing useful jobs in production, as long as you have high enough specs.