I've recently been implementing a language based on a conformance test instead of a spec, and it's really hard. I often have two tests that do different things with seemingly similar inputs, and I struggle to find the underlying reason behind what is going on and what the cause of the difference in output is.