Hacker News new | past | comments | ask | show | jobs | submit login

How do people determine the "current rate of progress"? There is absolutely no empirical standard to evaluate the performance of AI systems. How is this anything else but a gut feeling? And how is that feeling different from any other period? Minsky et al famously declared that AGI was away a few months of hard work, and they did it for the same reason, they lived through a period of dynamism in computer science. People definitely said it after Deep Blue beat Kasparov.

Progress in AI doesn't imply that we're dangerously close to AGI, just because people at any given time are amazed by individual breakthroughs they witness.




> There is absolutely no empirical standard to evaluate the performance of AI systems. How is this anything else but a gut feeling?

Why do you think this?

There are loads of tests of their performance. Common one right now is to give LLMs the same exams we put humans through, leading to e.g. the graph on page 6: https://arxiv.org/pdf/2303.08774.pdf

Are they the best tests? Probably not! But they are definitely empirical.


But LLMs are good at those tests because they've seen (some version of) the answers on the internet.

Give students concurrent access to the internet and I'm sure they can pass all sorts of tests.


An irrelevant counterargument, IMO.

First, students only get good after studying — education is not some magic spell cast by the teacher that only operates on a human's immortal soul. As we should not dismiss what students learn just because we could look it up, it is strange to dismiss what GPT has learned just because it could be looked up.

Second, the GPT-3 (and presumably also GPT-4) training set is about 500e9 tokens, which is what? Something like just a few terabytes?

We've been able to store that in a pocket for years now without being able to do almost any of the things that GPT can do — arbitrary natural language synthesis let alone arbitrary natural language queries — on a computer, even when we programmed the rules, and in this case the program learned the rules from the content.

Even just a few years ago, SOTA NLP was basically just "count up how many good words and bad words are in the text, the sentiment score is total good minus total bad."

That difference is what these test scores are showing.


> How do people determine the "current rate of progress"? There is absolutely no empirical standard to evaluate the performance of AI systems.

I would measure using something similar to Yudkowsky's challenge: "What is the *least* impressive feat that you would bet big money at 9-1 odds *cannot possibly* be done in 2 years?" [1]

Pay a panel of experts to list their predictions each year, including an incentive to get it right, and then measure the percentage of those predictions that fail anyway.

[1] https://twitter.com/ESYudkowsky/status/910566159249899520


Why wouldn't we be able to evaluate their performance and compare them to humans? The purpose of test datasets is to do just that, and new ones are created every day. By combining several of them, we can create a decent benchmark. We could even include robotic abilities but I think this is not necessary.

Let's say: adversarial Turing test + MMLU + coding competence (e.g. AAPS or Leetcode) + ARC (IQ-type test) + Montezuma's Revenge and other games like Stratego or Diplomacy + USMLE (medical exam) + IMO (math) + self driving + ...

You can even make it harder: have human judges blindly evaluate new scientific papers in math or theoretical physics for acceptance, see if AI can create highly-rated new apps, write a highly-rated book, compose a hit song...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: