What test cases do folks here recommend for measuring this new model's ability t...

gooseus · on March 23, 2023

OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.

You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.

macrolocal · on March 23, 2023

The author claims 61.0% on WinoGrande vis-a-vis GPT-4's 87.5%.

pffft8888 · on March 23, 2023

"you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance."

Is that 61% using the non-parallelizable RNN mode or the standard mode? I wonder if it's the latter.

This new model may be a viable alternative to ChatGPT, which is not only closed sourced but can be shut down in the future just as they did with the older text-davinci models.

Plus, the alignement and safety has rendered ChatGPT useless for helping with areas such as critical analysis of social issues (that go against the aligned views) and any and all critical thinking that goes against the aligned views of those who own and program ChatGPT. This could a viable free (as in freedom) alternative.

macrolocal · on March 23, 2023

I think the Cambrian explosion is just beginning.

mach1ne · on March 24, 2023

I hope not but day by day it seems more likely. If text-generating LLMs can reach superhuman cognition they will so so in a matter of a few years. At that point a Waluigi prompt will be like arming a virtual nuclear missile.

macrolocal · on March 24, 2023

Nuance: computers have been accumulating superhuman cognitions for half a century. But most people are bad at recognizing intelligence they don't relate to.

MaxikCZ · on March 23, 2023

I can't seem to find it in GitHub repo, do you know the value for ChatGPT before it switched to GPT-4?

macrolocal · on March 23, 2023

Here are a few benchmarks:

https://paperswithcode.com/sota/common-sense-reasoning-on-wi...

akavi · on March 24, 2023

How’d GPT-3/3.5-turbo do?

cyanf · on March 24, 2023

Looks like 81.6%.

macrolocal linked this below: https://paperswithcode.com/sota/common-sense-reasoning-on-wi...