You know the one thing that has helped me out the most, an error reporting servi...

jadams3 · on Nov 13, 2019

This. I have a really hard time measuring it, but ever since we really worked on error reporting our week-end sleep factor has greatly improved.

For a complex system though, don't under estimate how hard this is to do though ... - Every cloud service needs to be routed to a common service - All of your software, every language, even that cool Go experiment - All of the third party software - logs all have to agree on a format, JSON is not always an option.

Finally ... justification of time spent fixing things with no observable side effect(s). Most cloud stuff is reliable against first orders of failure and so are tolerant to a lot of stuff, it's designed that way. But once the wheels come off, and they will come off, ... buckle up if you haven't been fixing those errors. If you aren't clean on second order failures, you're in for a rough ride.

mcintyre1994 · on Nov 13, 2019

We use AWS, and one benefit of their hosted ElasticSearch is that they can build you a lambda that syncs Cloudwatch logs to ES, handling a variety of different formats. So we have our beanstalk web requests + some lambda infra + our main web backend etc. all synced to ES with very little effort.

You do have the downside that they don’t have eg nicely synced structure, but that also has the upside that the structure is closer to what the dev is used to so nobody ever needs to go back to CloudWatch or any other logs to get more details or a less processed message. The other downside is you have to write a different monitor for each index, though this has the upside that you can also have different triggers per index. In our small team we just message different slack channels which makes for a nice lightweight opt in/out for each error type.

It’d definitely be tricky to get everything aligned in eg the same JSON format, but this sort of middle ground isn’t too hard and still has benefits - you just need to be already syncing in any format to CloudWatch - which if you’re in AWS you probably are.

Timberwolf · on Nov 13, 2019

Totally agree. In my experience "that's just X it does that sometimes" have been symptoms of some of the scariest bugs in the system we've been working on. A couple of examples:

- a caching issue that was a "just X" on a single server, but took product search (and by extension most of the business) offline if two servers happened to encounter the same problem at the same time.

- a "just X" on user logins, which turned out to be a non-thread-safe piece of code that resulted in complete outage of all authn/authz-related actions once demand hit a critical point.

On top of that, having a culture where there are errors it's okay to not fix is tremendously damaging to team values. I've not seen a team with this attitude where the number of "just X" errors wasn't steadily increasing, with many of the newer ones being quite obvious and customer-affecting problems.

enobrev · on Nov 12, 2019

I like to keep a slack channel (and saved kibana search) for 500s for this exact reason. System-wide we should have no 500s, and when they happen I like to tackle them immediately. I also have daily reports for other various errors, like caught exceptions, invalid auths, etc just so I can see where things aren't going quite right in case it's indicative of something weird going on.