Hacker News new | past | comments | ask | show | jobs | submit login

Among experienced teams, most failures aren't caused by single-node/single-service errors. They've already designed & tested for that case, and the ability to handle them is baked into the architecture.

The interesting failures are caused by a cascade of errors - someone writes an innocent bug, which causes a single-node fault, which exercises some pathway in the fault recovery code that has unintended side-effects, which results in an unexpected condition elsewhere in the system.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: