Feels like we're like a year away from local LLMs that can debug code reliably (via being hooked into console error output as well) which will be quite the exciting day.
Have you tried Code Llama? How do you know it can't do it already?
In my applications, GPT-4 connected to a VM or SQL engine can and does debug code when given error messages. "Reliably" is very subjective. The main problem I have seen is that it can be stubborn about trying to use outdated APIs and it's not easy to give it a search result with the correct API. But with a good web search and up to date APIs, it can do it.
I'm interested to see general coding benchmarks for Code Llama versus GPT-4.