Tests have dependencies. Crawling all of those dependencies to check for malicious code could require inspecting millions of lines of code, if you could even obtain the code.
It's also beginning to sound like needing to solve the halting problem.
Look, I know you have a lot invested in this project but I don't see why you think it is somehow unreasonable to expect an AI agent to run tests in a repository. You don't need super intelligence for that.
It's also beginning to sound like needing to solve the halting problem.