This tells me you haven't really stress tested the model. GPT is currently at the stage of "person who is at the meeting, but not really paying attention so you have to call them out". Once GPT is pushed, it scrambles and falls over for most applications. The failure modes range from contradicting itself, making up things for applications that shouldn't allow it, to ignoring prompts, to simply being unable to perform tasks at all.
We have given it extensions, and really the extensions do a lot of the work. The tool that judges the style and correctness of the text based on the embedding is doing much of the heavy lifting. GPT essentially handles generating text and dense representations of the text.
Still waiting to see those plugins rolled out and actual vector DB integration with GPT 4, then we'll see what it can really do. Seems like the more context you give it the better it does, but the current UI really makes it hard to provide that.
Plus the recursive self prompting to improve accuracy.