What test cases do folks here recommend for measuring this new model's ability to reason? and, specifically, if it can reason about code with similar (or better!) performance to ChatGPT4? Has anyone managed to get it running locally?
OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.
You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.
"you can fine-tune RWKV into a non-parallelizable RNN (then you can use outputs of later layers of the previous token) if you want extra performance."
Is that 61% using the non-parallelizable RNN mode or the standard mode? I wonder if it's the latter.
This new model may be a viable alternative to ChatGPT, which is not only closed sourced but can be shut down in the future just as they did with the older text-davinci models.
Plus, the alignement and safety has rendered ChatGPT useless for helping with areas such as critical analysis of social issues (that go against the aligned views) and any and all critical thinking that goes against the aligned views of those who own and program ChatGPT. This could a viable free (as in freedom) alternative.
I hope not but day by day it seems more likely. If text-generating LLMs can reach superhuman cognition they will so so in a matter of a few years. At that point a Waluigi prompt will be like arming a virtual nuclear missile.
Nuance: computers have been accumulating superhuman cognitions for half a century. But most people are bad at recognizing intelligence they don't relate to.