Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I recently attended a talk by netflix on building fault tolerant systems. From what I could tell, there wasn't an architecture, everything could talk to everything else, with a thin Api layer on top. The way they achieve fault tolerance is by not having an ops team, instead they have a resiliency team and monitor everything. This team uses various monkeys to try to break the application in different ways.

They then put all developers on call, and force them to write code that can recover from faults by trying different methods to break it.

tl;dr netflix systems are chaotic, because chaotic systems tend to tolerate failures.



Yes, building your approach around chaos monkeys (https://github.com/Netflix/SimianArmy) has some real advantages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: