> 1) tests take too long to run 2) difficult to root cause breakages. I have got...

> 1) tests take too long to run 2) difficult to root cause breakages.

I have gotten a ton of value out of targetting these problems specifically. Some low-cost-high-value test infra changes:

1. Passing tests are not allowed to emit error level logs without expecting them; if the expected error does not occur, the test fails. This is good in general, but also makes root causing test failures easier as if you encounter an unexpected error log you can attach it to the failure message, and often that pinpoints the root cause.

2. Have lots of off-by-default instrumentation. For example, if a test fails auto-rerun it with sanitizers on.

3. In tests, gather backtraces when moving tasks between threads. This makes identifying what triggered an error straightforward.

4. Collect data during test runs, like line coverage, in a sequential order that can be diffed between passing and failing runs.

As for making tests fast, the solution is aggressive optimization just like prod code - profile the test, and fix what makes it slow. Cut corners where it makes sense - should you ever fsync in a test? VM snapshotting can be very useful in this area for things that are slow to start up.