Hacker News new | past | comments | ask | show | jobs | submit login

Search is (mostly) stateless. They can copy the search database to hundreds of DCs around the world, and each can operate independently - so if one of them has a problem, it's not likely to affect the others (unless it's bad data, but that's why you shouldn't deploy index updates to all DCs simultaneously). There's some personalization, sure, but you can just fall back to non-personalized search if that breaks.

Things like gmail, on the other hand, are inherently stateful. When you log into gmail, you have to eventually connect to one system that maintains your mailbox. Sure, there might be replication - but the replicas are all talking to each other. It's surprisingly easy to have a cascade failure in a system like this, where one of the replicas going down triggers (directly or indirectly) all the others failing as well. Or you can have some bad data that gets replicated out, and then proceeds to confuse everything that's looking at it - unlike search, you have to replicate that data immediately, and don't get to enjoy the benefits of a staged deployment.

This also explains why not all users were affected - I'd guess that their system is divided into some number of shards, and users are assigned to a particular shard. That 0.07% of users affected probably represents a single unhealthy shard.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: