Software resilience: The 500 test

Animats · on April 14, 2018

I've been struggling with transactional consistency across the network in, of all things, Second Life. Yes, Second Life, the virtual world.

The classical web is mostly stateless, although that's changing with fancier web sites. Second Life has always been very much a persistent state system. Technically, this makes it an alternate universe to the classical web. So it hit these problems long ago.

Second Life's world is divided into regions 256 meters on a side. The viewer displays a seamless world, but internally, the seams are very real. Each region is maintained by a separate process, a "sim", loosely coupled to neighboring sims and the viewer.

Avatars and vehicles can cross region boundaries. Often badly. For over a dozen years, region crossing behavior in SL has been fragile. Objects sink through the ground or fly off into space at region crossings. Vehicles and avatars become separated. Avatars with elaborate clothing and attachments can even be damaged so badly that the user has to do extensive repair work. The Second Life community, and the developers of Second Life, were convinced this was un-fixable.

I became interested in this problem as a SL user and started to work on it. The viewer is open source, and the message formats are known. Within the Second Life world, objects are scriptable. So, even without access to the internals of the server, much can be done from the outside.

My goal was to make fast vehicles cross regions properly in Second Life. The first step was to fix some problems in the viewer. The viewer tries to hide the delay at a region crossing handoff with extrapolation from the last position and velocity. The extrapolation amplifies noise from the physics simulator, and can result in errors so bad that vehicles appear to roll over, fly into the air, or sink into the ground. I managed to limit extrapolation enough to restore movement sanity.

Once that was under control, it was clear there were several different remaining problems. They now looked different visually, and could be attacked separately. There were several race conditions. I couldn't fix them completely from the outside, but I was able to detect them and prevent most of them from the scripting language code which controls vehicles. This got the mean time between failures up to about 30-40 minutes for a user driving a fast vehicle. When I started, that number was around 5-10.

The remaining problems were intermittent. I discovered that overloading the network connection to the viewer tended to induce this failure. So I used network test features in Linux to introduce errors, delays, and packet reordering. It turned out that adding 1 second of network delay, with no errors or reordering, would consistently break region crossings. This provided, for the first time, a repeatable test case for the bug.

I couldn't fix it, but I could provide a repeatable test case to the vendor, Linden Labs. With some publicity, upper management was made aware of the bug, and effort is now being applied to solving it. It turns out that the network retransmission code has problems. (Second Life uses its own UDP-based protocol.) Fixing that may help, but it's not clear that it's the entire problem.

The underlying problem is that, during a region crossing, both region manager programs ("sims") and the viewer all have state machines which must make certain state transitions in a coordinated way. The error cases do not seem to have been thoroughly worked out. It's possible to get stuck out of sync. Second Life is supposed to be a consistent-eventually system, but that wasn't achieved.

This is roughly the same problem as the "500 test" in the parent article. If you have communication problems, both ends must automatically resolved to a valid consistent state when communication resumes. Distributed database systems have to do this. It's not easy.

Network-connected state machines are a pain to analyze. If the number of states at each end is not too large, examining all possible combinations by hand is feasible. That's what the 2 volumes of "TCP/IP Illustrated" do for TCP. If you create your own network connected state machines, you face a similar task. If you don't do it, your system will break.

zzzcpan · on April 14, 2018

Right. The article is actually quite bad and doesn't show much understanding of the problem. Fault tolerance and consistency are very well known and very well researched problems in distributed systems.

Without proper distributed algorithms a fault can leave both ends in an inconsistent state. One end might think the operation was successful, but the other end never saw it succeed and timed out. It doesn't know whether it was successful or not. If there are users most will assume failure in this case and try again, leading to two state mutations where there should be only one.

To solve this you can try to achieve consensus between both ends and if there are users force them to wait an unbounded time, which is unrealistic of course, so only helps somewhat. The other choice is eventual consistency, preferably strong eventual consistency, where write operations never fail, never force anyone to wait, just get delayed and resynced later sometimes.

Animats · on April 14, 2018

For webcrap, it's probably best to avoid this problem entirely. Don't design systems that need tight consistency between client and server. You'll get it wrong.

Maybe only one side needs to be stateful. Consider a map display application, where the map is downloaded as tiles of various resolutions. The user can pan and scroll. The goal is to give the user a seamless experience. This may require loading low-rez tiles first, loading ahead in the direction of movement, and canceling requests for tiles that are not in yet and are no longer needed. It needs to be eventually consistent - whatever the user does, the map needs to settle into validity after a while.

But the server side can just blindly serve tiles as individual files. You can do all the stateful stuff on the client. No need for synchronization.

So don't go there unless you absolutely have to.

arbie · on April 16, 2018

> For webcrap, it's probably best to avoid this problem entirely. Don't design systems that need tight consistency between client and server. You'll get it wrong.

> So don't go there unless you absolutely have to.

Couldn't agree more. It is possible to build feature-rich clients with few syncing calls as long as you are willing to replicate business logic on the client-side and always transfer state unidirectionally.

saltcured · on April 13, 2018

We have faced this in a perverse form with some of our Python web apps behind Apache HTTPD and mod_wsgi. Apache itself will generate a 500 sporadically due to its internal TSL or HTTP/2 proxy code tripping over itself. If we capture one of these responses, it has the default Apache error page HTML structure, instead of the error page our own web framework would generate for a 500 if an uncaught exception leaked out.

This 500 response to the client can occur before or after our service has been invoked, and our service logic doesn't know anything went wrong. Our service might even log successful processing, meaning that it made its own ACID commit and started to generate a success response. For very small responses like "204 No Content", we might never know there was a problem. For larger responses, the WSGI layer may produce errors when we cannot generate a whole response before the connection is lost.

In our AJAX apps in front of this service, we have had to resort to treating 500 the same as a lost response, doing backoff and retry without assuming we know whether the first request succeeded or failed.

stephengillie · on April 13, 2018

From time supporting a real estate website host - I call it 'debugging by error' - the error code tells you which part of the stack is broken.

400 - URL/client side.

401 - Permissions. If you have IIS, maybe the web agent service account password is expired.

402 - Pay your bill.

403 - Permissions, or a deeper error.

404 - Missing file.

500 - Code error.

502 - Load balancer has no good hosts to service request. Or, if you have IIS, this is the Windows kernel saying no worker processes are working.

503 - Database overloaded or down.

504 - Server or load balancer not responding.

vvanders · on April 14, 2018

Lots of wisdom here but much of this is easier said than done.

That said a lot of the Erlang ethos covers this(things will fail at scale so do something reasonable).

mattsfrey · on April 13, 2018

My point may seem facetious, but wouldn't ensuring (through proper catches and guards) that a 500 doesn't happen at all inside your own systems be the best resiliency? And when you experience one, it's a red alert and you harden that component to whatever failed immediately? I guess that's just the school of thought I've operated under.

RSZC · on April 14, 2018

Those approaches don't need to be mutually exclusive. However, the main assumption of the post is that these events will occur regardless of how hard you try to prevent them:

> ...there will always be some number off in the tail lost to an accidental bug, a bad deploy, an external service that fails to respond, or a database failure or timeout.

It's just advocating the same approach that's the norm for distributed systems: operate under the assumption that some components of the system will fail. This is assumed to be the case in an area such as modern database design, but is rarely considered when it comes to behavior of a generic web service.

That said, IMO there's tons of cases where if your data is inconsistent for a few records you honestly just don't need to care, and so the idea of devoting attention and engineering resources to something that's both rare and low impact just doesn't make sense. There is a disclaimer in the post that this is only advised for critical services, but I think that's being ignored in the discussion.

arbie · on April 16, 2018

> there's tons of cases where if your data is inconsistent for a few records you honestly just don't need to care

Enterprise applications can be data-sparse, but every incorrect record field has disproportionately high downstream impact.

c0nfused · on April 14, 2018

The problem comes once you leave and that 500 red alert goes. You want to give the person after you enough time to figure out what is up and fix it before your boss's boss's boss rolls in with a head of steam.

It is gonna fail eventually so make it happen gracefully.

ssl232 · on April 13, 2018

ACIDity is an annoyingly missing feature in WordPress. For software that's used on something like 20% of all websites, it doesn't use the InnoDB MySQL table format by default, and therefore doesn't support MySQL transactions out of the box.

amelius · on April 14, 2018

> In the event of any internal fault (500), your service should be left in a state that’s consistent, ...

What if the software enters a non-terminating loop?

baconomatic · on April 13, 2018

This is interesting. Have you implemented this in services before? How have you handled continued failure for specific requests?

rwz · on April 13, 2018

I recommend reading his pieces on transactional state machines that is linked in the article. I believe that a lot of it is adopted at stripe currently.

baconomatic · on April 14, 2018

I'll check it out, thanks!

dpritchett · on April 13, 2018

OP has been in the API business for quite a few years at Heroku and Stripe. You can find him on Twitter sometimes if you have specific questions - not sure he'll read this.