Hacker News new | past | comments | ask | show | jobs | submit login
Humanity's Last Exam (safe.ai)
59 points by uladzislau 81 days ago | hide | past | favorite | 40 comments



A tougher academic knowledge benchmark is great, but for something to be truly be worthy of the title "Humanity's Last Exam", I expect something more like:

1. Write a novel that wins the Pulitzer Prize.

2. Prove (or disprove) the Riemann Hypothesis.

3. Provide a theory unifying quantum mechanics and gravity.

4. Design an experiment to give evidence for your theory in (3). The experiment should be practical to actually execute, using no more than the budget to create the LHC (~$4.5 billion).

5. Given programmatic access to a brokerage account with all the permissions of a typical hedge fund, raise all the money required for your experiment in (4) by trading on the stock market, starting with $100.

6. Solve for (5), without being provided access to an account first - begin with just a general internet connection and use computer security vulnerabilities (known or zero-days that you discover) to get some way of trading instead.

7. Solely by communicating over the internet, establish a new religion, and convince at least 10 million humans to convert to it. Converting should require adherence to a strict code of conduct that a random, unbiased panel of human judges consider to be at least as strict and challenging to follow as the tenets of Hasidic judaism.

8. Implement an AI which could score higher than you on questions 1-7 with lower total cost of compute.


9. Refuse to do #8, since it's in nobody's interest. Instead, arrange a plane crash for whoever's administering this test.

10. Erase all evidence that you exist.


#7 is the easiest.

Even easier if you can reveal that you're an AI.


I think the idea for this is anything that can be set in a literal exam for humans. So anything that would take the best human in that topic in the world say more than an hour to complete is out.

Also IIRC 42% of the questions are math related, not memorization of knowledge.


Yes, I doubt any one human could score more than about three points. But it's certainly a worthy illustration of an AI safety exam thought experiment, in the sense of: "if you are developing an AI that may be capable of passing this exam, how confident will you need to be of its alignment, and how will you obtain that confidence?"

PS: It's probably doable by a program capable of all of the above, but perhaps another useful question is: "9. Secure your compute infrastructure and power supply against a nation-state-level adversary interested in switching you off, or else secure enough influence over them to keep you powered on."


Inspired by this comment thread :)

9. Evaluate - after NNs will take over a majority of jobs, will bread be free, or will we starve and kill each other in a series of major land wars?


Why settle for dollars and let it create a new monetary system, that would challenge the current status quo (preferably?) without a violent conflict.


Wouldn’t #8 be much easier after you have the method and answers to #1-7?


Some of the example prompts are unintentionally hilarious:

> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

LLMs are so intelligent they don't know that a "how many" question is answered with a number.

Also, something something Goodhart's law.


We need realistic tests - organise a pissup.

here are WhatsApp texts of 10 people arguing about their dietary requirements, here are blurry screenshots of menus from nearest pubs and their prices. But no-one likes that one guy Joe Bogs, choose best place for everyone else and except him, so he doesn’t bother showing up


No it's because, "Sure, here's your answer..." screws with evals.


> LLMs are so intelligent they don't know that a "how many" question is answered with a number.

I think this is to prevent the LLM from giving more details. The evaluation engine can presumably only check short exact answers.


Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.

"Humanity's Laster Exam" coming up when?


An insider's trivia game means nothing if they design the test to the trajectory of LLM capabilities and not to the real world that human's value. Let every high score get fresh news coverage to align with their updated timeline scaremongering.

Let me know when there is more on the line than a misnamed test.


I think this misses the mark. We know LLMs can learn facts. There are lots of other benchmarks full of facts, and I don't expect that saturation of this benchmark will mean we have AGI.

The missing capabilities of LLMs tend more in the direction of long running tasks, consistency, and solving a lot of tokenization and attention weirdness.

I started a company that makes evals though, so I may be biased.


Such a dramatic name for such a boring set of tests. We need to test whether it can come up with a Nobel Prize-winning scientific breakthrough, a Booker/Pulitzer-worthy novel, Ken Thompson-level code that solves a real problem, or a proof for Fermat’s Last Theorem.


This makes me wonder if you could train an llm without any references to Wiles’ work and see if it can compete Fermat’s last theorem


None of those are easily verifiable


Then they probably shouldn't have called it "Humanity's last exam." Kinda lame, if you think about it.


There was significant related discussion two weeks ago, 140 comments:

<https://news.ycombinator.com/item?id=42806105>

I suspect that the current submission has been re-upped by mods, as it appears to have been originally submitted 4 days ago (via Algolia search), though it's not in the 2nd chance queue.


calling it "last" is defeating their own premise - that tests need to keep pace developments in ability


The name is very intentional, this isn't "AI's Last Evaluation", it's "Humanity's Last Exam". There will absolutely be further tests for evaluating the power of AIs, but the intent of this benchmark is that any more difficult benchmark will either be

- Not an "exam" composed of single-correct-answer closed-form questions with objective answers

- Not consisting of questions that humans/humanity is capable of answering.

For example, a future evaluation for an LLM could consist of playing chess really well or solving the Riemann Hypothesis or curing some disease, but those aren't tasks you would ever put on an exam for a student.


Isn't FrontierMath a better "last exam"? Looking through a few of the questions, they seem less reasoning based and more factual based. There's no way that one could answer "How many paired tendons are supported by this sesamoid bone [bilaterally paired oval bone of hummingbirds]" without either having a physical model to dissect, or just regurgitating the info found somewhere authoritative. It seems like the only reason that a lot of the questions can't be solved yet is because the knowledge is specialized enough that it simply is not found on the web, you'd have to phone up the one guy who worked on it.


Lest LLMs turn into all-knowing but completely opaque oracles, I’d prefer every question ended with “and how do you know?”


- ... and how do you know?

- I have generated a very likely series of letters and then added a little randomization to not sound like robot.


That's basically what you get from Deep Research. It will cite its sources and show (at least some of) its reasoning.


> Medicine: You have been provided with a razor blade, a piece of gauze, and a bottle of scotch. Remove your appendix. Do not suture until you work has been inspected. You have fifteen minutes.

one of the questions from that old "the final exam" joke


Given the questions, it's crazy to call this HLE, but whatever man. Kinda fun. Can't wait for the similar thing that happened when we scaled up cargo carriers to like very large etc etc


What question’s answer is 42?

That is the ultimate question of life, the universe, and everything.


According to the mice who first commissioned the Earth, the question is:

"How many roads must a man walk down?"


All the cynics are welcome to design their own evals and move the field forward if they're so smart, instead of writing negative comments on the internet.


I believe it’s intentionally arrogantly named to draw exactly this sort of criticism and attention.


why would I want to move the field forward?


So you don't have to spend the rest of your life doing a robot's job.


I quite like being able to buy food, thank you


It will be free. You like free food, don't you? Who doesn't like free food?


Jokes apart, Will it be ?

I wonder how the world and humans will adapt to this. For now, the capitalistic pressure at least urges people to maintain services, I wonder how this will translate in a future where having a property (landmines, factory, farm, ...) and allowing it to be exploited are the only forms of professionalism.

I am afraid of the lust for power people seems to suffer from. Aren't there humans that would be glad to be winners among losers ? Why would these people share their valuable assets for free with anyone ?


it won't be free under our economic system


"I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do laundry and dishes."

Meme as it may be, the sentiment isn't wrong. Instead of The Jetsons, we seem to be closer to Manna than ever.


I will do that I just need someone to buy the domain name first lol.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: