How is this "proven"? Can you point to some benchmarks that demonstrate this? Everything I've seen thus far has been fairly mediocre/terrible outside of a few cherry-picked prompt/response combinations.
They're awesome technical achievements and will likely improve of course but you're making some very grand statements.
There are benchmarks in the original LLaMA paper[1]. Specifically, on page 4 LLaMA 13B seems to beat GPT-3 in BoolQ, HellaSwag, WinoGrande, ARC-e and ARC-c benchmarks (not by much though). Examples that you've seen are likely to be based on some form quantisation / poor prompt that degrade output. My understanding that the only quantisation that doesn't seem to hurt the output is llm.int8 by Tim Dettmers. You should be able to run LLaMA 13B (8 bit quantised) on the 3090 or 4090 consumer grade GPU as of now. Also, you'd need a prompt such as LLaMA precise[2] in order to get ChatGPT like output.
I had a similar impression from what I saw. Maybe it does perform as well as GPT-3 on narrow tasks that it was explicitly fine-tuned on, but that similarity in performance seems to collapse as soon as you go off the beaten track and give it harder tasks that involve significant reasoning. Consistent with that I've seen a few different sources claim that a small model fine-tuned off the outputs of a large one would likely struggle with unfamiliar tasks or contexts that require transfer learning or abstraction.
After seeing how it actually performs in practice, it's hard to have confidence that these benchmarks are reliable measures of model quality.
They're awesome technical achievements and will likely improve of course but you're making some very grand statements.