Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not difficult, it's just different. It's the difference between predicting that a truck might crash into a data center and building concrete wall around it, and designing a system in a such way that users only ever resolve to servers that are currently available regardless of what happened to some of them in a data center that had a truck crashed into it.


... and after you've solved for the truck problem, you have a potentially infinite list of other things to plan for, some of which you will not foresee. And of course, there's probably an upper bound on the time you can spend preparing for such things.

Famous to the point of being a cliche, the titanic was thought to be unsinkable, and I would have a similarly hard time convincing the engineers behind the ship's design to believe otherwise.

The level of confidence you're displaying in predicting the unforeseeable is something you may want to take a deeper look at.


You are missing the point. Solving the truck problem is exactly what you shouldn't do, well, at least until your system is resilient. Because it could be something entirely different, it could be law enforcement raiding a data center and your wall around it won't protect it from them. So instead you approach the system in terms of what it has to rely on and all possible states of the thing it has to rely on. Which maps to a very small number of decisions. Like whether a server is available or not. If it's not available it really doesn't matter which of the infinite things that could happen to it or to a data center it is in actually did, you simply don't return it to users if it's not available and have enough independent servers to return to users in enough independent data centers to achieve specific availability. It's really not difficult.

I understand that most of those leetcode corporations don't care much about resilience, likely even incapable of producing highly reliable systems, and may give you a false impression that reliability is something of an unachievable fantasy. But it's not, it's something we have enough research done on and can do really well today if needed, we are not in titanic era anymore.

I have high confidence in these things (not in "predicting the unforeseeable"), because I've done them myself. My edge infrastructure had like half an hour of downtime total in many years, almost a decade already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: