I'm not aware of any specific reading... but I'll share a few more thoughts:
The key is to not make it a blame game. Every bug that is discovered on staging or production should (in theory) trigger a root cause analysis of some kind... because it means that one of the earlier processes failed.
If you have manual QA people, it's mostly just improving the QA plan and adding edge cases and domain knowledge, but it can also sometimes be that there are data bugs or integrations on production that are not on staging, slightly different on staging, etc.
With every bug there is an action plan for how to fix the bug, but that is separate from the knowledge that we gain about our system because the bug occurred. Maybe the fix is to get someone to manually clean up some dirty data that wound up on production due to someone forgetting to validate/clean input data ... that's great but from the QA perspective you might have learned that several of your steps failed.
So I think the main thing is considering bugs opportunities to learn about the system as a whole. The philosophy behind chaos monkey at NetFlix is that even a well tested, solid system needs to be resilient, so any opportunity to make your system stronger (regardless of the cause) is a good thing. In particular, any bug found before it hits production is a win overall.
I'd also add that it's important to let the knowledge flow back out of QA and into the product team, etc. QA people often end up becoming internal domain experts who catch lots of issues, but that is something that quickly exceeds what one person can remember/understand as a system scales, so organizational learning/practices pay off big.
I really appreciate your comment. One of the more exciting parts of expanding has been getting the opportunity to think about systems and processes at a higher level now that I'm not the only one fighting fires. I'll take this advice to heart.
The key is to not make it a blame game. Every bug that is discovered on staging or production should (in theory) trigger a root cause analysis of some kind... because it means that one of the earlier processes failed.
If you have manual QA people, it's mostly just improving the QA plan and adding edge cases and domain knowledge, but it can also sometimes be that there are data bugs or integrations on production that are not on staging, slightly different on staging, etc.
With every bug there is an action plan for how to fix the bug, but that is separate from the knowledge that we gain about our system because the bug occurred. Maybe the fix is to get someone to manually clean up some dirty data that wound up on production due to someone forgetting to validate/clean input data ... that's great but from the QA perspective you might have learned that several of your steps failed.
So I think the main thing is considering bugs opportunities to learn about the system as a whole. The philosophy behind chaos monkey at NetFlix is that even a well tested, solid system needs to be resilient, so any opportunity to make your system stronger (regardless of the cause) is a good thing. In particular, any bug found before it hits production is a win overall.
I'd also add that it's important to let the knowledge flow back out of QA and into the product team, etc. QA people often end up becoming internal domain experts who catch lots of issues, but that is something that quickly exceeds what one person can remember/understand as a system scales, so organizational learning/practices pay off big.