99.99% seems off by orders of magnitude to me. I don't have an exact number but I routinely see GPT 3.5 hallucinate, which is inconsistent with that level of confidence.
I've noticed this discussion tends to get too theoretical too quickly. I'm uninterested in perfection, 99.99% would be good enough. 70% wouldn't. The actual number is something specific, knowable, and hopefully improving.
It's definitely not that good if we share a definition of poor data/prompts.
This afternoon I tried to use Codium to autocomplete some capnproto Rust code. Everything it generated was totally wrong. For example, it used member functions on non-existent structs rather than the correct free functions.
But I'll give it some credit: that's an obscure library in a less popular language.
I've noticed this discussion tends to get too theoretical too quickly. I'm uninterested in perfection, 99.99% would be good enough. 70% wouldn't. The actual number is something specific, knowable, and hopefully improving.