When I was in Google SRE we had monitoring and enforcement of permitted and forbidden RPC peers, such that a system that attempted to use another system would fail or send alerts. This was useful at the top of the stack to keep track of dependencies silently added by library authors, and at the low levels to ensure the things at the bottom of the stack were really at the bottom. We also did virtual automated cluster turn-up and turn-down, to make sure our documented procedures did not get out of date, and in my 6 years in SRE I saw that procedure fall from 90 days to under an hour. We also regularly exercised the scratch restarts of things like global encryption key management, which involves a physical object. The annual DiRT exercise also tried to make sure that no person, team, or office was necessary to the continuing function of systems.