It's starting to seem like they use some sort of message bus to tie everything together. If the servers actually processing messages can't handle the volume, the entire show comes grinding to a halt. I've seen similar things in other industries where the entire system looks like its perfectly healthy, and then before you know it 100% of your systems are down because you simply can't push peak message volume or because one participant is infinite-looping messages.
Their director of infra tried to recruit me and basically said that they wanted to replace every relational data store with message queues instead... that struck me as a bit weird and overzealous.
Seems like infra problems.