Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I did some experimenting with this a little while back and was disappointed in how poorly LLMs played games.

I made some AI tools (https://github.com/DougHaber/lair) and added in a tmux tool so that LLMs could interact with terminals. First, I tried Nethack. As expected, it's not good at understanding text "screenshots" and failed miserably.

https://x.com/LeshyLabs/status/1895842345376944454

After that I tried a bunch of the "bsdgames" text games.

Here is a video of it playing a few minutes of Colossal Cave Adventure:

https://www.youtube.com/watch?v=7BMxkWUON70

With this, it could play, but not very well. It gets confused a lot. I was using gpt-4o-mini. Smaller models I could run at home work much worse. It would be interesting to try one of the bigger state of the art models to see how much it helps.

To give it an easier one I also had it hunt the Wumpus:

https://x.com/LeshyLabs/status/1896443294005317701

I didn't try improving this much, so there might be some low hanging fruit even in providing better instructions and tuning what is sent to the LLM. For these, I was hoping I could just hand it a terminal with a game in it and have it play decently. We'll probably get there, but so far it's not that simple.



Try the game 9:05 by Adam Cadre [0]. It's one of the easiest (and best) non-trivial text adventures. Some models are able to reach the first or even second ending.

[0] https://en.wikipedia.org/wiki/9:05


What do you suppose would happen if you tried it on a game that doesn't have 25 years of walkthroughs written for it?


That’s a good point. For 9:05, I expect it would work just as well, since the game helps the user in many ways. The puzzles are of the type “The door is closed”, and you solve them with “open door.”

My suggestion concerns the poor performance DougHaber mentioned: if 9:05 can’t be solved, something else must be wrong with his experiments.

I’ve tried three dozen games, and it’s still hard to find ones suitable for LLM benchmarks. With non-linear complex text-adventure games, my guess is, that they get stuck in an endless loop at some point. Hence, I just test the progress in the first hundred steps.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: