You'd probably separate example tests and validation tests. Also test descriptio...

You'd probably separate example tests and validation tests. Also test descriptions should fed into the prompt to help guide it, like BDD style tests.

On test failure, the data is fed back into the prompt about what failed for another iteration.

This will help avoid over-fitting, and generate another generation on test failure. I mean you can't guarantee correctness, but you could probably get it pretty close. Humans also have the same problem.