Really outstanding article, OP. Well done. Slacked to the rest of our engineering team. We're using Google kubernetes to solve some of the fail-over/self-healing challenges you noted, and some of the other ideas as well (proxies, data store fit-to-function, etc.) but there are a couple of things in here that we either hadn't thought of (Nagle algorithm) or needed to be reminded of (timestamps at every layer, the cost of logging/eventing). Thanks for the good lunchtime read.