It's no mystery that the kinds of systems with massive reach--Google search, Facebook, Twitter--are really not data-consistency-critical applications.
Who'll ever know if a search result isn't perfectly up to date or perfectly accurate?
Who'll ever know if you missed a Facebook feed entry because it "wasn't relevant" or simply wasn't seen due to DB vagaries?
And who'll ever know about a few tweets going astray here or there?
In all cases, they're all likely "eventually consistent" (or close to it), but it's no accident that it doesn't ultimately matter in those massive scale examples.
And maybe that's the secret to massive scale--it can't ultimately matter.
One of the startups I worked on was a classifieds aggregation engine pulling data from external feeds.
We basically queued all retrieved items for processing with no attempts at avoiding data loss whatsoever - including using in memory queues for lots of things.
Our reasoning was that if a machine crashed, worst case was that a few listings would take up to 24 hours to update, but generally much less (we adapted crawling rate for our sources based on change frequency; so large sources of listings would get re-indexed far more frequently, so if a feed didn't update or 24 hours it'd be because it wasn't a source of much data), and we could force refreshes of the data.
Some people were horrified at the approach because the idea of ensuring consistency and not losing data is so ingrained. But the reality is that you need to measure the cost of consistency up against the value it provides. And often it's not very valuable, especially when there is an authoritative source of the data to recover from and when the data will be outdated quickly anyway.
A lot of the time any notion of consistency is an illusion anyway - by the time the page has returned, the results are outdated - and what matters is maintaining the illusion (e.g. ensure that if a user makes an update it's reflected in the page that's returned).
The key is you need to know the tradeoffs and apply them consciously rather than get caught out by tradeoffs components you rely on makes without telling you..
Absolutely. I wrote a system to provide real time signature updates to Symantec's (then MessageLabs) global anti spam infrastructure. It used UDP to send them. It worked amazingly well and distributed signatures across thousands of servers worldwide in milliseconds and if a packet got dropped oh well. It's about choosing the right tool for the job.
People used to purely transactional systems where every piece of data they received were important who were not used to think about data as potentially disposable... Nothing they didn't pick up quickly enough, but a bit of a mental adjustment (for my part, my job right before that one was running a billing system; I was very happy not to have to worry about that any more...)
Who'll ever know if a search result isn't perfectly up to date or perfectly accurate?
Who'll ever know if you missed a Facebook feed entry because it "wasn't relevant" or simply wasn't seen due to DB vagaries?
And who'll ever know about a few tweets going astray here or there?
In all cases, they're all likely "eventually consistent" (or close to it), but it's no accident that it doesn't ultimately matter in those massive scale examples.
And maybe that's the secret to massive scale--it can't ultimately matter.