eval is crucial, as that's the only way to agree with the computer what it is doing matches what you want it to do.
I wouldn't worry about speeds. We should expect compute and inference to be faster in the future, where we can sample easily 10k programs in a second, check all of them against test cases.
I'd worry about communicating precisely to computers when test cases are awkward to write.
For example, if we can cache/re-use the compiler run on previous codegen candidates, to speed up compilation of the next candidate sniplet