It's really hard to keep moving all the way up and down the stack.
The main advantage, to me, is being able to collapse layers of abstraction. We've done that at one of the startup's I CTO for.
On the backend, we eliminated the entire multi-tier stack we've come to know and love from the late 90s. The database, application server, caching server, and authentication server were collapsed down into a single executable.
We dropped TCP, and went to UDP + NaCl-based crypto. This in turn changed how we did session management and logins (we only target mobile devices), and allowed us to gain further performance by bypassing the Linux kernel entirely, and talking directly to the Ethernet hardware. Our wire-to-wire latency for a UDP packet (without any processing, just heading through the LuaJIT app framework) is ~26 nanoseconds. We measure the entire request/response time in less than 10,000 nanoseconds for updates, and frequently, less than 3000 nanoseconds for reads.
Nanoseconds! Today's hardware is crazy efficient, but yesterdays software architectures waste it all.
For caching, we employ a database library (lmdb) that uses the file system and memory mapping for all data, a lot like Varnish Cache does. We can service reads without a single malloc() call. No more memcached.
Finally, the entire system is built around an event streaming approach (see Disruptor/LMAX). For that to work, we wrote our own cross-platform Core Data alternative on the client, and for this particular application (a social network), the end result is that the user experiences ZERO network latency for everything but search.
Everything is local by the time the user is made aware of it. For updates like making a post, we make it happen instantly on the client, speculatively, and eventually the update makes it's way to the server and back, and then on to everyone else. The user doesn't even need a live network connection to post, and "offline" usage works as expected.
And also (since this is a speciality of mine), our object graph syncing is multi-device from day one. Lots of projects (e.g. Vesper) struggle with that one. We didn't bolt it on, it's a fundamental property of our data model and overall system.
After collapsing the crypto, database, and authentication, we were able to introduce object capabilities that requires zero memory lookups, instead of the usual RBAC, which in my experience is difficult to implement efficiently and hard for people to manage.
And because we have our own network protocol, and are targeting native clients we control, we can easily do client-side load balancing. We use consistent hashing to contact read-only servers (that also handle crypto), mostly so we can spread out the bandwidth without having a crazy expensive load balancer. Thanks to our custom database/app server, we can service all writes on a single machine, although like LMAX, it's designed to run three in parallel, discarding the results of two of them at any given time.
Point is, there are HUGE gains to be made when you (a) understand the whole stack, (b) have the skills and understanding to rewrite any part as you need to, and (c) have the guts to ignore 15 years of received wisdom on how to scale an app and are willing to collapse layers down in pursuit of speed and simplicity.
It may be harder, but the payoff is enormous. I did all of the above myself in just over four months, including building a cross-platform app framework for iOS (porting to Android in March), and the app itself.
I don't want to put words into your mouth, but I want to clarify something. Those 90s style architectures that had database servers, application servers, caching servers etc (N-tier) were not about scaling the number of connections you could support on a given set of hardware or maximizing message throughput through a system or minimizing message latency. Those architectures came about as a response to a common problem of the time which was lots of different legacy systems needing to talk to each other.
Think accounting systems on a mainframe talking to inventory systems written in C talking to HR systems provided by the consultantware flavor of the month. In those kinds of environments it made a ton of sense to break up the various functional tiers so that you could attach to the appropriate one with minimal translation layers.
Having worked on systems like that in the past, you'd never go in thinking that it is the lowest latency way to do it. Latency wasn't typically a major design driver.
That said, I work on low latency message systems now and the architecture your describing is very typical in that space.
the architecture your describing is very typical in that space
Totally, and whatever innovation I actually produce is mainly putting together cool stuff other people have done in a novel way (which at times, hardly seems innovative). I'm hugely indebted to all of the research and pioneers in production in our industry.
One of the things I love about CS is how easy it is to take cool shit from one subfield and apply it to another. There's just so many smart people figuring stuff out that to a guy like me, that's like Christmas every week. :)
This is also my way of thinking and the way we are doing things in Snabb Switch. I'm influenced by my brief time working with Mitch Bradley on firmware hacking at OLPC. Linux would never get the same performance out of hardware as his seemingly toy little Forth drivers.
luke@snabb.co if you fancy a chat some time :-)
P.S. Thanks for the generous link-plug that landed on highscalability.com. I look forward to hearing more about your systems in the future.
I'm doing two right now. It's not as exciting as it seems.
The first I worked on for a year straight starting fall 2012, it launched, we got our first customer which exceeded our year-one goal in revenue by 2-3x, and was too big for us to handle anyone else without raising funds. We're replacing a Lotus Notes system for a $100 million/year company with 800+ employees.
My dev partner in Australia has been handling that account ever since (it's an Australian company, and I'm in the US), and as we've been working together for almost 8 years, there's not a lot of back and forth needed at this point. Mostly just weekly meetings and the occasional Skype call. My plan is to pick development back up again on that startup mid-Summer in preparation for expanding beyond that first company.
The other startup I began immediately after the first launched, and I've been designing and coding basically non-stop for 4 months. So there's hasn't been much overlap for me, and so far, it's worked out okay.
In my experience, I'm most useful launching projects. I can do the architecture, code up the initial design, and then at that point, other (better) programmers can see what it is I'm trying to do, and buckle in for the long haul.
(BTW I'm contemplating CTOing a third company, if anyone has one they'd like to pitch me (erich.ocean@me.com). I've got 3-4 months of time available to work on something new. If you project can be launched in that amount of time, I might be able to help.)
Although that sounds really cool, are you really sure you're not the person this article is aimed at? How much do your users care about near-zero latency compared to, say, having more of their friends on the network?
(Of course, I don't presume to know your situation.)
Yeah, getting users is much more important. Fortunately, that part has already had two and half years development and a successful (especially, given the quality—it made Friendster feel fast) app, so that part of the business was well taken care of.
That's actually my favorite part about this particular company. Normally, it's a bunch of tech and business guys trying to "get traction". This business already had traction, and just needed a better app and infrastructure.
So I took four months and solved that problem for good. :)
I would love to hear more about this. I was looking for a way to send UDP that is encrypted and authenticated, with private contents and no replay attacks. The best option I could find is DTLS, but that has spotty support on Windows and no API in common scripting languages. Can you say more about your approach? If it is roll-your-own, what makes you confident it is secure?
I can't speak for erichocean but I've played around with both UDP and NaCL (http://nacl.cr.yp.to/). NaCl is very easy to use, and quite fast. It's not easy to install, but once it's there, it's nice to use.
And because of my playing around with NaCL, I've come to the conclusion that encryption isn't the hard problem---key management is the hard problem.
I was looking for a way to send UDP that is encrypted and authenticated, with private contents and no replay attacks.
I use crypto_box/unbox as is, although we do do the crypto_beforenm()/crypto_afternm() as (a) our clients are known, and (b) it's ridiculously faster. The docs on NaCl mention it, sort of, but it's much, much, faster. If you can do it, you always should do it.
As for replay attacks, we use what I'll call "hash chaining" (if someone knows the real name, I'd like to know). We have two kinds of messages, ordered and unordered. Unordered messages can be replayed at any time, as they are read-only, or some other kind of status messages, such as a heartbeat.
For ordered messages, we hash the previous message and use it as input for the next message when it is hashed. Both hashes are sent with each message (we use Sip-Hash 2-4, highly recommended BTW). (One analogy would be how git works, where the previous commit's hash is hashed as part of the new commit's hash.)
The chain of messages from the device to the server is called the "device stream", and we have a second chain going back to the device, the "update stream". The system supports multiple device streams for a single users (e.g. an iPhone and an iPad), but there's only one update stream for all devices, so they stay in sync.
Both ends of the connection test the hashes before applying, which trivially prevents replay attacks, since every ordered message can only be "played" once. We store the current device stream hash when the database is updated, in the same transaction, on both the client and the server. This provides resiliency in the face of crashes.
We also use the hash chain to detect out-of-order messages. In our application, they're pretty rare, so I simply drop out-of-order messages and request the correct one from the client.
This could be bad, because playing back old messages would illicit a response. For that reason, and also because I don't want to store update streams in the database permanently, the messages include their offset in the stream. The in-memory, wire format, and on-disk format of each messages is identical, so we can lay them end-to-end on disk. When a messages comes in, we can trivially determine if it is old, and drop it that way, too. If we need to write an update stream to disk, because the client hasn't connected in a while and we want to free up space, it's a single sendfile() + known offset call, conceptually, to send the update stream since they last signed back on. That way, we don't need a separate index for streams on disk.
Not sure if serious... this reads like one of those over-the-top joke comments bragging tongue-in-cheek about how 10x the author is.
I'm going to go out on a limb and say that:
1. Ditching TCP for a social network app is insane.
2. Bypassing standard POSIX sockets for a social network app is insane.
3. Writing a custom database server for a social network app is insane.
4. There's no way you did that in four months, especially if you were CTO for multiple startups.
Even if you're the greatest developer alive on Earth today and you really did all that, think of how many amazing features you could have developed in that time to differentiate your product, instead of obsessing over insignificant nanosecond latency gains.
The main advantage, to me, is being able to collapse layers of abstraction. We've done that at one of the startup's I CTO for.
On the backend, we eliminated the entire multi-tier stack we've come to know and love from the late 90s. The database, application server, caching server, and authentication server were collapsed down into a single executable.
We dropped TCP, and went to UDP + NaCl-based crypto. This in turn changed how we did session management and logins (we only target mobile devices), and allowed us to gain further performance by bypassing the Linux kernel entirely, and talking directly to the Ethernet hardware. Our wire-to-wire latency for a UDP packet (without any processing, just heading through the LuaJIT app framework) is ~26 nanoseconds. We measure the entire request/response time in less than 10,000 nanoseconds for updates, and frequently, less than 3000 nanoseconds for reads.
Nanoseconds! Today's hardware is crazy efficient, but yesterdays software architectures waste it all.
For caching, we employ a database library (lmdb) that uses the file system and memory mapping for all data, a lot like Varnish Cache does. We can service reads without a single malloc() call. No more memcached.
Finally, the entire system is built around an event streaming approach (see Disruptor/LMAX). For that to work, we wrote our own cross-platform Core Data alternative on the client, and for this particular application (a social network), the end result is that the user experiences ZERO network latency for everything but search.
Everything is local by the time the user is made aware of it. For updates like making a post, we make it happen instantly on the client, speculatively, and eventually the update makes it's way to the server and back, and then on to everyone else. The user doesn't even need a live network connection to post, and "offline" usage works as expected.
And also (since this is a speciality of mine), our object graph syncing is multi-device from day one. Lots of projects (e.g. Vesper) struggle with that one. We didn't bolt it on, it's a fundamental property of our data model and overall system.
After collapsing the crypto, database, and authentication, we were able to introduce object capabilities that requires zero memory lookups, instead of the usual RBAC, which in my experience is difficult to implement efficiently and hard for people to manage.
And because we have our own network protocol, and are targeting native clients we control, we can easily do client-side load balancing. We use consistent hashing to contact read-only servers (that also handle crypto), mostly so we can spread out the bandwidth without having a crazy expensive load balancer. Thanks to our custom database/app server, we can service all writes on a single machine, although like LMAX, it's designed to run three in parallel, discarding the results of two of them at any given time.
Point is, there are HUGE gains to be made when you (a) understand the whole stack, (b) have the skills and understanding to rewrite any part as you need to, and (c) have the guts to ignore 15 years of received wisdom on how to scale an app and are willing to collapse layers down in pursuit of speed and simplicity.
It may be harder, but the payoff is enormous. I did all of the above myself in just over four months, including building a cross-platform app framework for iOS (porting to Android in March), and the app itself.