A tougher academic knowledge benchmark is great, but for something to be truly be worthy of the title "Humanity's Last Exam", I expect something more like:
1. Write a novel that wins the Pulitzer Prize.
2. Prove (or disprove) the Riemann Hypothesis.
3. Provide a theory unifying quantum mechanics and gravity.
4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).
5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.
6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.
7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.
8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.
I think the idea for this is anything that can be set in a literal exam for humans. So anything that would take the best human in that topic in the world say more than an hour to complete is out.
Also IIRC 42% of the questions are math related, not memorization of knowledge.
Yes, I doubt any one human could score more than about three points. But it's certainly a worthy illustration of an AI safety exam thought experiment, in the sense of: "if you are developing an AI that may be capable of passing this exam, how confident will you need to be of its alignment, and how will you obtain that confidence?"
PS: It's probably doable by a program capable of all of the above, but perhaps another useful question is: "9. Secure your compute infrastructure and power supply against a nation-state-level adversary interested in switching you off, or else secure enough influence over them to keep you powered on."
Some of the example prompts are unintentionally hilarious:
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.
LLMs are so intelligent they don't know that a "how many" question is answered with a number.
here are WhatsApp texts of 10 people arguing about their dietary requirements, here are blurry screenshots of menus from nearest pubs and their prices. But no-one likes that one guy Joe Bogs, choose best place for everyone else and except him, so he doesn’t bother showing up
Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.
An insider's trivia game means nothing if they design the test to the trajectory of LLM capabilities and not to the real world that human's value. Let every high score get fresh news coverage to align with their updated timeline scaremongering.
Let me know when there is more on the line than a misnamed test.
I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.
The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.
I started a company that makes evals though, so I may be biased.
Such a dramatic name for such a boring set of tests. We need to test whether it can come up with a Nobel Prize-winning scientific breakthrough, a Booker/Pulitzer-worthy novel, Ken Thompson-level code that solves a real problem, or a proof for Fermat’s Last Theorem.
I suspect that the current submission has been re-upped by mods, as it appears to have been originally submitted 4 days ago (via Algolia search), though it's not in the 2nd chance queue.
The name is very intentional, this isn't "AI's Last Evaluation", it's "Humanity's Last Exam". There will absolutely be further tests for evaluating the power of AIs, but the intent of this benchmark is that any more difficult benchmark will either be
- Not an "exam" composed of single-correct-answer closed-form questions with objective answers
- Not consisting of questions that humans/humanity is capable of answering.
For example, a future evaluation for an LLM could consist of playing chess really well or solving the Riemann Hypothesis or curing some disease, but those aren't tasks you would ever put on an exam for a student.
Isn't FrontierMath a better "last exam"? Looking through a few of the questions, they seem less reasoning based and more factual based. There's no way that one could answer "How many paired tendons are supported by this sesamoid bone [bilaterally paired oval bone of hummingbirds]" without either having a physical model to dissect, or just regurgitating the info found somewhere authoritative. It seems like the only reason that a lot of the questions can't be solved yet is because the knowledge is specialized enough that it simply is not found on the web, you'd have to phone up the one guy who worked on it.
> Medicine: You have been provided with a razor blade, a piece of gauze, and a bottle of scotch. Remove your appendix. Do not suture until you work has been inspected. You have fifteen minutes.
one of the questions from that old "the final exam" joke
Given the questions, it's crazy to call this HLE, but whatever man. Kinda fun. Can't wait for the similar thing that happened when we scaled up cargo carriers to like very large etc etc
All the cynics are welcome to design their own evals and move the field forward if they're so smart, instead of writing negative comments on the internet.
I wonder how the world and humans will adapt to this.
For now, the capitalistic pressure at least urges people to maintain services, I wonder how this will translate in a future where having a property (landmines, factory, farm, ...) and allowing it to be exploited are the only forms of professionalism.
I am afraid of the lust for power people seems to suffer from. Aren't there humans that would be glad to be winners among losers ? Why would these people share their valuable assets for free with anyone ?
1. Write a novel that wins the Pulitzer Prize.
2. Prove (or disprove) the Riemann Hypothesis.
3. Provide a theory unifying quantum mechanics and gravity.
4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).
5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.
6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.
7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.
8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.