Massively faster DB with async batching

steder · on Jan 26, 2012

It's a great idea except: You're now eventually consistent. You've reimplemented a ton of database logic in his application. The database already reorders inserts and checks for conflicts. The database is also already doing its best to be durable (write-ahead logs for example). You can't run more than one server due to the custom batching and reordering logic and in memory caching and a hardware failure will result in loss of data.

You've put all your data in one basket. Anything happening to the application server results in data loss. Of course hardware failures happen and all the batching behavior here seems like it exacerbates the problem. A single server handling all writes and reads also makes it particularly difficult to update that server at any point in time. When is it safe to restart this server to apply an update or fix a bug?

Basically the author has avoided using a NoSQL database by reinvented a bunch of things to provide all of the downsides of a NoSQL database (eventual consistency, no ACID semantics) without any of the upsides (high availability, replication).

Alternative solutions to the "handle many requests" on top of a SQL server problem?

Have more application servers. You don't need to service all user requests from a single machine. You don't need to have user 10 wait on user 1s database write if they're independent. Multiple servers have other benefits like rolling updates...

Use memcache.

Use a (durable) message queue to offload work from a the web servers and do non-essential updates/inserts offline.

And of course: Just use a NoSQL database and design your application appropriately.

willvarfar · on Jan 26, 2012

Fair enough viewpoint. I think its more nuanced than that.

The DB queue may be a buffer but the callbacks are not triggered until success/failure (and when caching like I did for the SELECTs that doesn't happen until then too). This is just the same as the twisted/tornado db connection pool approach to getting the DB access async, but is getting much better performance out of it by the merging client-side logic; its still the same schematics and there's no risk of a request succeeding but the DB not.

willvarfar · on Jan 26, 2012

I have put some sketches into my post to describe when things hit disk and when the client making the request continues. I hope this clarifies all the ACID questions.

steder · on Jan 26, 2012

Ah, you don't respond to the user until the DB interactions for a given user have been processed as a batch. I was under the impression that on a write you were: updating the in memory dataset so that the user could immediately read their data after the write, queuing a write to MySQL, and then responding to the user (200 OK) even though the data had not been committed.

So my concern regarding consistency was that during this period after responding to the user but before your queue committed data that a bug or hardware failure would result in some users thinking that there requests were successful but none of their work would be saved.

As long as the user doesn't receive a response until data has been committed to your DB this makes sense.

Now that I understand the flow a bit better I agree that it makes sense. This is similar to what folks were trying to do with MySQL-Proxy. I still think that you're asking for trouble by coupling your components together in the same process and would really encourage you to take it to the next level by factoring out caching and proxying/batching of db access into their own services.

The more I think about this the more it seems like it would be reasonable to build as a separate twisted process and exposed through a api compatible with twisted.enterprise.adbapi... Figuring out reasonable (and general) ways to determine which statements can be safely batched could be quite challenging though.

willvarfar · on Jan 26, 2012

I happen to be using Python's built-in new multiprocessing module. It has a suitable queue. I am using the same logic to talk to rabbitmq. I would avoid tcp.

steder · on Jan 27, 2012

Where's rabbitmq in your architecture? What's your concern with TCP?

willvarfar · on Jan 27, 2012

I meant I I'd rather use Queue.Queue or multiprocessing.Queue to talk to a worker than TCP. Its just my preference.

Its generally the principle of least-power is that you'd use a thread (nice for the callback handling, and for 'interactions') or a process (if you worry about the GIL, but it complicates callbacks and interactions), and only jump to a TCP server if you really really could justify it.

I also meant that the approach of workers that do coalescing is useful for non-DB things too; we're using a worker to buffer rabbitmq, for example.

jrockway · on Jan 26, 2012

It's worth noting that many "NoSQL" databases provide ACID semantics along with high availability and replication. (Example: BerkeleyDB)

metabrew · on Jan 26, 2012

an equally amusing solution is to just turn off fsync - DB calls should speed up, and data is buffered by the kernel before being written to disk, instead of in the app.

skrebbel · on Jan 26, 2012

This appears to me as a ridiculously good idea. I've no clue whether it's really novel, but to me it is.

Anyone got a clue about the implications? Also, how difficult to do? (besides than the author's own impressions) Would it be possible make a more generic solution?

julochrobak · on Jan 26, 2012

What do you find so rediculously good about this? I find it complicated. If I understood it properly, it's just making inserts asynchrounous (removing durability) and batches data into single statement (removing atomicity). Or did I miss something?

willvarfar · on Jan 29, 2012

You missed something; the only part of ACID its relaxing is isolation: http://en.wikipedia.org/wiki/ACID#Isolation

"This property of ACID is often partly relaxed due to the huge speed decrease this type of concurrency management entails."

Indeed.

If you have the viewpoint that DB validation is for catching programming errors, not invalid data (i.e. you validate everything before it hits DB anyway), this relaxation can be reasoned about client-side anyway.

willvarfar · on Jan 26, 2012

Its still ACID. Its the same callee code as you use in, for example, twisted already to have DB async. It just gets way better performance because the DB speed is measured in requests/sec and not how big those requests are. So merging many requests into a single one where possible means you get more throughput.

julochrobak · on Jan 26, 2012

I'm sure the statements to the database comply to ACID properties. Nevertheless, from the client perspective the batching simply means that request from client one is batched together with request of client two - i.e. the atomicity is broken. If the operation fails due to problems with data from client one it impacts client two. I guess my point is, why use a database which comply to all ACID properties if you don't need them.

marshray · on Jan 26, 2012

But it's not ACID. For example, if the server or process running his database worker queue were to crash, records would be lost.

Of course, it's quite likely an acceptable trade-off for for these records. Relational DBs sold everyone on ACID in the 90's, but a lot of data doesn't really deserve its cost. He hasn't found a way around Brewer's Conjecture (and I wish I had a nickel for every time it's rediscovered :-).

willvarfar · on Jan 26, 2012

But then the callback to the request would timeout and the callee code waiting on the web request would be in no misunderstanding that the thing was committed or not.

Things in the queue get acknowledged - and sometimes resultsets or IDs returned, in the case of SELECT and INSERTs that return auto keys.

marshray · on Jan 26, 2012

OK, I see. So your batching related inserts to cut down on the per-transaction overhead, but not actually returning anything to the client until the commit is confirmed.

I think this is a good method, as long as you're careful about these things. (I assume you've taken care of them, just talking about this method in general).

1) If there were multiple transactions in the web request, previously the second ones likely wouldn't have been run if the first were to fail. This would likely change that and it could even necessitate having rollback for the second transaction.

2) Now there is a potential "thundering herd" problem. When the transaction batch commit notification is received, all 1000 web requests made over that 100 ms period become ready to execute and start competing for network/disk/database resources simultaneously. A disk that was just sitting idle now has 1000 different things to do at once. If everybody hits the network at once it can cause temporary packet loss. When packets are lost, connections can be dropped or delayed by retransmits. Sometimes client retries and retransmits can happen synchronously. In the worst case, such a system can end up oscillating wildly between excessive load and excessive idle.

willvarfar · on Jan 26, 2012

Yes this is what twisted's "interactions" are and how I talk about throwing rollbacking-exceptions and such in the article.

Yes, you do have to be careful about saturating links generally. The thundering herd thing is a rich man's problem :)

_3u10 · on Jan 26, 2012

It's really simple. I did it back in 2001, when I did it I used XML and a stored proc to read the XML and write directly to a cursor. It probably takes an hour to do it the first time and about 15 to 20 minutes once you're used to the idea of it. (on SQL Server 2000)

With something like SQL Server 2005 it's super simple (< 5 mins) you can basically write an XML file as a stream to your DB server and it eats it and writes directly to the table. (SQL Server supports transforming XML into a table using XPath). You can basically treat the connection like a filehandle and dump XML directly to the stream. Just make sure you open/close the connection often enough not to cause the lock optimizer to grab a table lock. (Or just grab a table lock if you really want it to scream)

All you need to ensure is that you don't hold the steam open too long. If you don't mind a little logic in your DB server you can even consume complex object graphs. If you go the XML route you can max the CPU on your DB server which is generally a bad thing for the most part on modern systems it won't matter unless your DB server has 200+ disks.

You can use this type of system to stream 200 or 300 MB worth of data to your server in a few seconds. This assumes your DB server isn't a piece of crap. (If it can't keep up with a gig ethernet card on random IO or is virtualized it's most likely a piece of crap).

The more generic version is to take advantage of the INSERT syntax and use multiples sets of values.

eg. INSERT INTO foo (id) VALUES (1),(2),(3); instead of INSERT INTO foo (id) VALUES (1); INSERT INTO foo (id) VALUES (2); INSERT INTO foo (id) VALUES (3);

Now because MySQL is one of the most retarded (I mean that in the literal sense of the word) databases in the world it doesn't support prepared statements so you get huge overhead when you issue SQL to it, compared to a prepared statement. Also, make sure you explicitly start a transaction before you issue a bunch of inserts, else each statement becomes its own transaction.

Also on the readside to increase performance use something like MARS and issue multiple SELECT statements.

eg. SELECT * from parent where id = 5; SELECT * from child where parent_id = 5;

If you have to do more than a few levels consider a SHAPE query. Or use map reduce and 'fold' your join back into something sane.

dkoch · on Jan 26, 2012

MySQL has had server-side prepared statements since 5.0.

rkalla · on Jan 26, 2012

Anyone have a good idea of what threshold should be introduced into the Coalescing Queue?

For example, on a slow server you could imagine a command coming in every second; you likely don't want to hold onto those commands for a second artificially to find out there are no new methods to try and combine it with; so you just want to answer right away.

Now your server is getting busier... 4 commands a second... do you want to artificially hold commands for 250ms+ or longer to see if there are new commands to combine with it?

I imagine not...

In the scenario where you are getting 500+ req/sec it makes a lot of sense to coalesce them, but at some point your queue needs to decide when to make a cut in the queue, take that slice of commands, process/combine them THEN send them to the server and it is that heuristic I'm curious about.

willvarfar · on Jan 26, 2012

I put in the batching point at: 1) over a certain time-limit, e.g. 100ms 2) over a certain number of commands, e.g. 2000 3) encountering a specific 'flush' command, that I injected sometimes to simply the logic

Those were the numbers I, by trial and error, determined for my server. But remember, I didn't do SELECTs; I put everything I might SELECT into memory. So my numbers are rather application-specific perhaps.

Its important to measure the average and worst case response times, not the best-case. Under light load, best-case times are from blocking calls, which is why conventional ORM looks good enough in prototyping ;)

rkalla · on Jan 26, 2012

Thanks Will; I missed all that for some reason. Appreciate you sharing your actual threshold numbers (would have been my second question).

dkhenry · on Jan 26, 2012

I think the problem being described here is he is using a database where he doesn't need to. This is the problem things like cassandra were made to solve. Effectivally he is not storing things to the DB he is logging them to the DB in case of a program crash. To speed things up he stores his entire data set in RAM client side. Writes are buffered and written to the DB, after they have modified the client side data. I assume the point is when the program crashes or dies it just rebuilds the client side data from the DB and then proceed to continue to log to the DB. He can get rid of a lot of his code and add in features but using something like MongoDB, Casandra , or Riak

jamii · on Jan 26, 2012

> Effectivally he is not storing things to the DB he is logging them to the DB in case of a program crash. To speed things up he stores his entire data set in RAM client side.

At this point its not worth even using a database. I'm currently working on an experimental framework at Smarkets where we just write events directly to disk and do everything else in memory (with the occasional snapshot to speed recovery). If all goes well it will be open-sourced in a few weeks.

LMAX have taken this idea to extremes and have some mind-blowing benchmarks - http://martinfowler.com/articles/lmax.html

willvarfar · on Jan 26, 2012

I think the merging and reordering stands alone as a solution; the extension where, in my particular project, I cache to avoid some common SELECTs is a distinct other subject. They complement each other, and are in the same post, is all.

willvarfar · on Jan 26, 2012

I might trust Riak's reputation, but it also has a reputation for being slow.

MongoDB and Cassandra both seem immature and unstable at the moment; all you hear is horror stories.

rkalla · on Jan 26, 2012

I don't think "all you hear" is accurate. Mongo and Cassandra have some truly huge deployments and the "horror stories" are a function of the number of deployments.

Search "MySQL sucks" or "Oracle sucks" and you'll find much the same thing. It is a result of being (successfully) deployed so many places.

As an aside, if you want confidence in your NoSQL solution, reading either the Mongo or Cassandra mailing lists is an excellent way to realize how focused, efficient and powerful those companies and communities are. Of all the NoSQL db lists I read, Cassandra is one of the lowest incident levels and when there is, no matter how esoteric, it seems like one of the Datastax guys or someone from the community is able to help them out.

As for Mongo, the fact that Eliot and team still replies 100s of times a day to all the questions, even complex reports, is beyond impressive to me.

I am just trying to frame your comment a bit; these projects have left the "cool, large-scale toy" category a long time ago and have many companies better their existence on them. Netflix currently has one of the largest global Cassandra deployments (US east-west and UK) and continue to move out of SimpleDB and move more and more into their own custom deployment.

willvarfar · on Jan 26, 2012

Fair enough, they may have come along way.

I worked on that project with a chap who happend to also be a cassandra contributor, but we still didn't dare go with cassandra back then.

And mongodb feels more squid than varnish; its just an uneasy feeling I get when I follow their crazy technical journeys ;)

rkalla · on Jan 26, 2012

Just to clarify; I am not saying you are wrong and trying to talk you into it, just didn't want other readers to come across that comment and throw them out the window especially if they are a good fit.

That being said, I am a big believer in gut feelings and if yours is telling you the answer is elsewhere then it is probably right.

rarrrrrr · on Jan 26, 2012

If you're hitting the database concurrently with many sessions, you can get much of the same benefit by just adjusting the PostgreSQL commit_delay and commit_siblings paramaters, so the database batches things into fewer flushes and fsyncs for you.

willvarfar · on Jan 26, 2012

don't you want to know that the statement hit disk before continuing? If you're a web request thing that is, say, submitting an order, don't you want that order to hit disk before the web request continues, retrieving the order#?

rarrrrrr · on Jan 26, 2012

It will hit the disk before continuing. It just causes server side batching of more operations into fewer flushes and fsyncs, often improving overall throughput in the case that you have several connections concurrently hitting the DB.

Note also that if you do want to avoid atomic commit for particular transactions (less important stuff), you can do that by issuing SET LOCAL synchronous_commit TO OFF within the transaction. But that's not what commit_delay does.

http://www.postgresql.org/docs/9.1/static/wal-configuration....

http://www.postgresql.org/docs/9.1/static/runtime-config-wal...

willvarfar · on Jan 26, 2012

And doing the same client-side is often improving overall throughput in the a few connections multiplexed between a large number of requests in the same manner.

b0b0b0b · on Jan 26, 2012

Do you have a cluster of application servers? What transaction isolation level do you use? Will User A mind that his update is no longer a single transaction and that it may be commingled with txn's of Users B&C?

kermitthehermit · on Jan 26, 2012

This is pretty cool.

Do you plan to continue working on hellepoll? I liked the idea and would love to see it developed further.

willvarfar · on Jan 26, 2012

Most of what's built on hellepoll has been proprietary.

If hellepoll had a DB layer, it might well come with an merging reordering worker queue though ;)

kermitthehermit · on Jan 26, 2012

I see. It would be really nice to have a community which develops hellepoll.

After all, having a maintained alternative to node.js written in C++ would be very nice.

I'm not saying node.js isn't OK or that those who use it are doing something wrong. It's good to have many tools for different scenarios and different preferences.

Would you like to start such a community for hellepoll? I'd be the first one to join you.

Thank you for your reply and for publishing hellepoll.

willvarfar · on Jan 26, 2012

thank you for the kind words.

two does not a crowd make, sadly

willvarfar · on Feb 6, 2012

http://williamedwardscoder.tumblr.com/post/17166532605/helle... if you want to talk this through

blibble · on Jan 26, 2012

wouldn't it be easier (and have the same effect) if you just wrapped 100 statements (or say 60 seconds) of delay-able INSERTS into a transaction with the correct isolation value set?

or use INSERT DELAYED and let the database do it itself.

necro · on Jan 26, 2012

Author states that he is using innodb and insert delayed is a myisam mechanism. Of course in innodb the buffer pool / insert buffers can be thought of the same way. The delay the author is noting is probably due to flush of the log file data.

If you configure innodb correctly it comes down how much IO you can do on your log fs, to make sure you can sustain some update rate to this sequential log. Then nest is the IO to the db fs, so you can sustain the rate of the flushes. You can tune some of this much better with some percona mysql patches. Innodb does insert batching for you to minimize IO, and if you are doing updates to the same data it may even buffer the changes in buffer pool.

thesorrow · on Jan 26, 2012

tldr: Group your inserts ! "Together we are stronger" – Spartacus

rkalla · on Jan 26, 2012

A bit more than that: "Group, convert and collapse your DB commands!"

  insert 1, insert 2, insert 3 == insert [1,2,3]
  insert 1, delete 1 == [no-op]

As the author pointed out you need to know a bit more about your domain to make this really efficient, for example:

  insert user1.dob, delete user2.dob = [2 discrete ops]

while

  update user1.dob, delete user1.dob = [1 op - delete user1.dob]

Cool idea; lots of opportunities for enhancement. The one thing you would have to look out for is effectively re-implementing a query optimizer inside of your application code and eating up too much CPU time on a busy system prepping calls.

If you are not I/O bound, then this strategy isn't going to help you that much. If you are I/O bound, it is a great way to help get yourself out of that hole.

est · on Jan 26, 2012

offtopic: are there any direct sql to binlog async translator?