Hacker News new | past | comments | ask | show | jobs | submit login

I once joked with a colleague that my sqlite3 install is faster than his Hadoop cluster for running a report across a multi-gig file.

We benchmarked it, i was much, much faster.

Technically though once that multi-gig file becomes many hundreds of gigs, my computer would loose by a huge margin.




"Command-line Tools can be 235x Faster than your Hadoop Cluster"

https://web.archive.org/web/20200414235857/https://adamdrake...


I recently did some data processing on a single (albeit beefy node) someone had been using a cluster for. I composed and ran ETL in a day what took them weeks in their infrastructure (they were actually still in the process of fixing it).


At that point you can just get a bigger computer though.


No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

I have tried this, I literally wanted to create a simple web app that is powered by the cheapest solution possible, but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard! In the end I now launch redshift for a couple days, process all the data, then pipe it to Postgres running on a lightsail vps via dblink. Haven’t found a better solution.


My rule of thumb is that a single processor core can handle about 100MB/s, if using the right software (and using the software right). For simple tasks, this kan be 200+ MB/s, if there is a lot of random access (both against memory and against storage), one can assume about 10k-100k IOPS per core.

For a 32 core processor, that means that it can process a data set of 100G in the order of 30 seconds. For some types of tasks, it can be slower, and if the processing is either light or something that lets you leverage specialized hardware (such as a GPU), it can be much faster. But if you start to take hours to process a dataset of this size (and you are not doing some kind of heavy math), you may want to look at your software stack before starting to scale out. Not only to save on hardware resources, but also because it may require less of your time to optimize a single node than to manage a cluster.


> "using the right software (and using the software right)"

This is a great phrase that I'm going to use more.


This is a great rule of thumb which helps build a kind of intuition around performance I always try to have my engineers contextualizing. The "lazy and good" way (which has worked I'd say at least 9/10 times in my career when I run into these problems) is to find a way to reduce data cardinality ahead of intense computation. It's 100% for the reason you describe in your last sentence -- it doesn't just save on hardware resources, but it potentially precludes any timespace complexity bottlenecks from becoming your pain point.


>No you cannot, you cannot infinitely scale SQLite, you can’t load 100 G of data into a single SQLite file in any meaningful amount of time. Then try creating an index on it and cry.

Yes, you can. Without indexes to slow you down (you can create them afterwards), it isn't even much different than any other DB, if not faster.

>Even Postgres by itself was very hard!

Probably depends on your setup. I've worked with multi-TB sized Postgres single databases (heck, we had 100GB in a single table without partitions). Then again the machine had TB sized RAM.


> but it had to serve from a database that cannot be smaller than 150GB. SQLite failed. Even Postgres by itself was very hard!

The PostgreSQL database for a CMS project I work on weighs about 250GB (all assets are binary in the database), and we have no problem at all serving a boatload of requests (with the replicated database and the serving CMS running on each live server, with 8GB of RAM).

To me, it smells like you've lacked some indices or ran on a rpi?


It sounds like the op is trying to provision and load 150GB in a reasonably fast manner. Once loaded, presumably any of the usual suspects will be fast enough. It’s the up front loading costs which are the problem.

Anyway, I’m curious what kind of data the op is trying to process.


I am trying to load and serve the Microsoft academic graph to produce author profile pages for all academic authors! Microsoft and google already do this but IMO they leave a lot to be desired.

But this means there are a hundred million entities, publishing 3x number of papers and a bunch of metadata associated. On redshift I can get all of this loaded in minutes and takes like 100G but Postgres loads are pathetic comparatively.

And I have no intention of spending more than 30 bucks a month! So hard problem for sure! Suggestions welcome!


There are settings in Postgres that allow for bulk loading.

By default you get a commit after each INSERT which slows things down by a lot.


How many rows are we talking about? In the end once I started using dblink to load via redshift after some preprocessing the loads were reasonable, and indexing too. But I’m looking at full data refreshes every two weeks and a tight budget (30 bucks a month) so am constrained on solutions. Suggestions welcome!


Try DuckDB! I've been getting 20x SQLite performance on one thread, and it usually scales linearly with threads!


Maybe I'm misunderstanding, but this seems very strange. Are you suggesting that Postgres can't handle a 150GB database with acceptable performance?


I’m trying to run a Postgres instance on a basic vps instance with a single vcpu and 8gb of ram! And I’ll need to erase and reload all 150 GB every two weeks..


Had a similar problem recently. Ended up creating a custom system using a file-based index (append to files named by the first 5 char of the SHA1 of the key) Took 10 hours to parse my Terabyte. Uploaded it to Azure Blob storage, now I can query my 10B rows in 50ms for ~10^-7$. It's hard to evolve, but 10x faster and cheaper than other solutions.


My original plan was to do a similar S3 idea, but I forgot about it’s charge per 1000 gets and puts and had a 700 dollar bill I had to bargain with them to waive! Does azures model not have that expense?


I recall Azure was much cheaper and 30ms faster on average.


Curious if you tried this on an EC2 instance in AWS? The IOPS for EBS volume are notoriously low, and possibly why a lot of self-hosted DB instances feel very slow vs similarly priced AWS services. Personal anecdote, but moving to a a dedicated server from EC2 increased the max throughput by a factor of 80 for us.


Did you try to compare that to EC2 instances with ephemeral nvme drives? I'm seeing hdfs throughput of up to several GB/node using such instances.


You can use locally attached SSD instances. Then you're responsible for its reliability so not getting all the 'cloud' benefits. Used them for provisioning own CI cluster with raid-0 btrfs running PostgreSQL. Only backed up the provisioning and CI scripts.


Got burned there for sure! Speed is one thing but the cost is outrageous for io heavy apps! Anyways I moved to lightsail which doesn’t have io costs paradoxically so while io is slow at least the cost is predictable!


A couple of 64gb ram sticks and you can fit your data in ram.


It's all about the right tool for the job('s scale.)


You can skip Hadoop and go from SQLite to something like S3 + Presto that scales to extremely high volumes with low latency and better than linear financial scaling.


Does hundreds of gigs introduce a general performance hit or could it still be further optimized using some smart indexing strategy?


Everything is fast if it fits in memory and caches.


Unless it's accidentally quadratic. Then all the RAM in the world isn't going to help you.


And in 2021, almost everything* does.

*Obviously not really. But very very many things do, even doing useful jobs in production, as long as you have high enough specs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: