How do people determine the "current rate of progress"? There is absolutely no e...

ben_w · on March 23, 2023

> There is absolutely no empirical standard to evaluate the performance of AI systems. How is this anything else but a gut feeling?

Why do you think this?

There are loads of tests of their performance. Common one right now is to give LLMs the same exams we put humans through, leading to e.g. the graph on page 6: https://arxiv.org/pdf/2303.08774.pdf

Are they the best tests? Probably not! But they are definitely empirical.

dkjaudyeqooe · on March 23, 2023

But LLMs are good at those tests because they've seen (some version of) the answers on the internet.

Give students concurrent access to the internet and I'm sure they can pass all sorts of tests.

ben_w · on March 23, 2023

An irrelevant counterargument, IMO.

First, students only get good after studying — education is not some magic spell cast by the teacher that only operates on a human's immortal soul. As we should not dismiss what students learn just because we could look it up, it is strange to dismiss what GPT has learned just because it could be looked up.

Second, the GPT-3 (and presumably also GPT-4) training set is about 500e9 tokens, which is what? Something like just a few terabytes?

We've been able to store that in a pocket for years now without being able to do almost any of the things that GPT can do — arbitrary natural language synthesis let alone arbitrary natural language queries — on a computer, even when we programmed the rules, and in this case the program learned the rules from the content.

Even just a few years ago, SOTA NLP was basically just "count up how many good words and bad words are in the text, the sentiment score is total good minus total bad."

That difference is what these test scores are showing.

mitthrowaway2 · on March 23, 2023

> How do people determine the "current rate of progress"? There is absolutely no empirical standard to evaluate the performance of AI systems.

I would measure using something similar to Yudkowsky's challenge: "What is the *least* impressive feat that you would bet big money at 9-1 odds *cannot possibly* be done in 2 years?" [1]

Pay a panel of experts to list their predictions each year, including an incentive to get it right, and then measure the percentage of those predictions that fail anyway.

[1] https://twitter.com/ESYudkowsky/status/910566159249899520

zone411 · on March 23, 2023

Why wouldn't we be able to evaluate their performance and compare them to humans? The purpose of test datasets is to do just that, and new ones are created every day. By combining several of them, we can create a decent benchmark. We could even include robotic abilities but I think this is not necessary.

Let's say: adversarial Turing test + MMLU + coding competence (e.g. AAPS or Leetcode) + ARC (IQ-type test) + Montezuma's Revenge and other games like Stratego or Diplomacy + USMLE (medical exam) + IMO (math) + self driving + ...

You can even make it harder: have human judges blindly evaluate new scientific papers in math or theoretical physics for acceptance, see if AI can create highly-rated new apps, write a highly-rated book, compose a hit song...