What kind of bug would make machine learning suddenly 40% worse at NetHack?

surfingdino · on June 5, 2024

It's not technically a bug, but a "political compromise" by a self-professed "not a technical manager" who had aversion to including version numbers in API endpoints. He asked us to come up with a way of enabling tests of a new feature in the API (new response payload schema). We submitted two proposals, one with v1/v2 endpoints returning payloads using old and new schema, or a URL param ?v=1 or ?v-2... He was not happy and asked one of the senior technical architects of the project to weight in and the guy said we should use the URL param. That was not good enough and we were told to go away and wait for a decision. Three months later we were told to implement a new cron job to deploy v2 into the integration test environment. Another cron job would then deploy v1 into that environment two hours after deploying v2. Yes, you read that right, v2 would be available for two hours per week, because a non-technical manager would not trust software devs to do the right thing for their colleagues testing their implementations of the API. I learned recently that almost a decade later those cron jobs are still there. If your tests are only working on day X of the week within a two-hour window, that's why. Sorry, cannot name names due to NDA.

TrainedMonkey · on June 5, 2024

> It's not technically a bug,

It is absolutely an anti-feature. Most things in life function on some level of predictability. Computers are pretty high on predictability scale, do the least surprising thing and all that.

I sincerely hope there is some kind of feedback mechanism for the "not a technical manager" to either learn or get less decision making authority.

surfingdino · on June 5, 2024

No, the manager is still there having skilfully removed anyone who has any clue how to build, test, or deploy software from the organisation. And he does not accept points of view other than his own.

tojaprice · on June 5, 2024

What exactly is his aversion? Did he ever say? It seems like an inane hill for him to choose to die on.

surfingdino · on June 5, 2024

He was never able to explain his reasons nor did he want to be told it made no sense. I saw similar behaviour in other orgs.

mapt · on June 5, 2024

Ain't dead yet.

burnerphone · on June 5, 2024

So, there was a single critical path, that for 2 hours per week was overwritten with the v2 code.

This is pretty common. They wanted to test v2 in the regular environment, so they can see how it reacts to real user traffic. If it reacted well, they would replace v1 with v2 completely under the hood! See, they didn’t want v1 and v2, they wanted the existing endpoint updated while maintaining backwards compatibility!! Their system was too old or had poor test coverage, so in order to minimize risk of disruption to the service, they came up with the idea to try it out for 2 off-peak hours.

Why do that vs better test coverage? They were confident enough, or the risk of breaking things for a little but was understood and justified, maybe they had some super old machines or configs they cannot accurately model and really wanted to get proper user traffic in.

This is how many a/b tests happen too. Turn on feature for a segment, and then turn it back off. Or rolling deployments - have code run on 5% of servers then increase the percentage. Rollback if problem.

This is that just except time based. Kinda hacky but everything in the story sounds reasonable to me

surfingdino · on June 5, 2024

It was a new system, never went live in prod at the time. It was deployed to the integration environment where the API client implementations written by third parties would be tested against this new system. All that happened was a schema change. It made no sense to make it so complicated for the people the system was build for.

mattnewton · on June 5, 2024

That’s insane

sebstefan · on June 5, 2024

>Members of the Legendary Computer Bugs Tribunal, honored guests, if I may have your attention? I would, humbly, submit a new contender for your esteemed judgment. You may or may not find it novel, you may even deign to call it a "bug," but I assure you, you will find it entertaining.

It gets a pass but my top 1 remains "I can't send emails further than 500 miles"

http://web.mit.edu/jemorris/humor/500-miles

viraptor · on June 5, 2024

I'd go with "can't print in Tuesdays" https://bugs.launchpad.net/ubuntu/+source/cupsys/+bug/255161...

ffhhj · on June 5, 2024

At least it's no allergic to vanilla icecream:

https://news.ycombinator.com/item?id=21779857

david422 · on June 5, 2024

Hadn't seen that one before. Imagine trying to troubleshoot that "can't print" bug with someone the next day. I guess once you had somehow narrowed it down to Tues it would be reproducible.

eythian · on June 5, 2024

More like that: https://500mile.email/

sebstefan · on June 5, 2024

That's a gold mine!

https://predr.ag/blog/wifi-only-works-when-its-raining/

https://www.linkedin.com/pulse/application-crashes-when-i-dr...

https://gyrovague.com/2015/07/29/crashes-only-on-wednesdays/

OptionOfT · on June 5, 2024

There is an older story out there about WiFi failing during a storm, even though all the cables and antennas are fine. I'll see if I can find it.

Edit: it was the other way around.

https://predr.ag/blog/wifi-only-works-when-its-raining/

samsartor · on June 5, 2024

My favorite has always been "I can't log in when I stand up" https://www.reddit.com/r/talesfromtechsupport/comments/3v52p...

junon · on June 5, 2024

I hadn't seen this one, that's some awesome debugging by both parties.

laurencei · on June 5, 2024

I'm not a machine learning person - so I'm confused about this.

As someone who doesnt understand ML - I have always assumed the whole point of ML is to try different things in the game, almost randomly, and over (long) periods of time the AI gets better and better at the game.

If having a single unexpected event causes such a large swing in outcome, and the AI cant "explain" what is different to cause the swing, then what exactly is the ML doing for it to fail on such a seemingly simple change? Doesnt that defeat the whole purpose of this?

I'm obviously missing something obvious - because I would assume the real goal of ML is that it can teach itself the game, even if that involves unexpected situations, as a human does?

schattschneider · on June 5, 2024

This article doesn't describe it in detail. One scenario imaginable would be that they ran their model trained on non-full moon data for an evaluation on a full moon day. Which means the model would simply apply it's learned "optimal" action policy in a different environment, where the previously learned action policy doesn't lead to good scores anymore.

laurencei · on June 5, 2024

So does this mean if they allowed the game to run on "full moon days", it would be expected to eventually get a higher score (if the full moon day allowed that through the actual game mechanism)?

chongli · on June 5, 2024

Yes, the full moon day can help you get a higher score due to the bonus luck you get that day. On the other hand, the full moon day makes werewolves (and wererats and werejackals) a lot more dangerous because they'll always be in animal form.

When you try to fight a werecreature in animal form it can summon large numbers of animals of its kind to attack you. This can be extremely deadly for a player who is unaware of this ability. An experienced player knows to attack werecreatures only at range or avoid fighting them altogether. However, encountering the werecreature in its human form is much less dangerous unless it's carrying a powerful weapon.

sigmoid10 · on June 5, 2024

It's not a single event, it's more like a new general game state that was never seen during training. Imagine learning to play the violin really well and then someone changes the way acoustics permeate. It doesn't matter if you're a human or an ML algorithm, you're going to have a hard time playing like before.

queuebert · on June 5, 2024

But something is wrong in the learning, because as a human NetHack player who has ascended, I can say that we don't play radically different on full moons. Yes, the random numbers go your way slightly more, but that's about it.

This tells me the algo is trying to hard to predict the game or learn a decent static strategy, rather than make situational decisions.

chongli · on June 5, 2024

The issue is with werecreatures on a full moon. Most humans exposed to Western culture (likely all NetHack players) have heard of werewolves. I think it’s safe to say that everyone who has heard of werewolves knows they are most dangerous on a full moon. Even if you are a total NetHack beginner you know to avoid these monsters on the full moon. The game even helpfully reminds you of this information both by telling you about the full moon and by having a werewolf howl incessantly when it’s on the same level. However the game does not explicitly fill in the gap for you. It expects culture to do that.

The advantage of human common sense over machine learning models — at least when it comes to role-playing games — is that we carry around a ton of this cultural information. A model trained only on NetHack — not on broader culture or folklore/fairytales/mythology/fantasy — is simply not going to be aware of this link between full moons and specific monsters becoming more dangerous. So if it’s developed a fairly naïve strategy of just fighting or avoiding everything in its path based on a model of relative strength then it’s going to be tripped up when an outside event (the phase of the moon) upends that model.

queuebert · on June 5, 2024

I disagree. There are a lot of idiosyncrasies about monsters in NetHack such that getting good at NetHack is 99.9% about learning the NetHack world, not the real world. The werecreature game mechanic is no different than the POI or HALLU effect, so I don't think the AI needs any special knowledge. I

bet it comes down to how much memory the algorithm has, since the transformation might occur way later than being bitten, while most poison kills are fairly quick. The problem is NetHack requires you to have at minimum 1000 turns of memory to know when to pray. Even more if you want to keep track of where stuff was.

masklinn · on June 5, 2024

Tfa states that the agent was trained for points, and an other user states that some critters are a lot more dangerous during full moons.

Wouldn’t be very surprising if the agent hyper-optimised farming those critters for points. It would not be able to change strategy if the cost/benefit of that farming changed massively, so would now be performing significantly worse.

stetrain · on June 5, 2024

Humans train simultaneously as they operate, and humans can see the message about the full moon.

If nobody includes the full moon message as input to the ML model, and tries to operate the ML model with the training it has achieved running in non-full-moon mode, its operating score in full-moon-mode may be lower.

Even if it had proportional training time against full-moon-mode to incorporate that into the model, if you don't tell it when full-moon-mode is active wouldn't the optimal behavior be to optimize the score for 27/28 days vs 1/28 days of the month?

If full-moon-mode is an input to the model, then it can trained to optimize for both scenarios.

shagie · on June 5, 2024

https://nethackwiki.com/wiki/Time

I predict the next "annoying non-bug" will be Friday, June 13th of 2025.

laurencei · on June 5, 2024

So for ML to work, it has to know all permutations of a game? Does that mean ML is useless for non-deterministic games with random outcomes or procedural generation?

stetrain · on June 5, 2024

The thing it has learned previously needs to apply to the next run.

If you train an ML model on thousands of attempts at going around some racetracks where touching the walls slows you down, and the score is achieved by executing a fast lap, and the inputs to the model include where the car is and where the walls are, it should optimize towards avoiding touching the wall.

This behavior would likely still work even on new procedurally generated tracks that the model had not previously seen, as long as the relationship of inputs (car, walls) to desired behavior (fast lap) still applied.

If every N number of runs for a large value of N the game changes so that the walls are actually speed boosts and the center of the track slows you down, and there is no input to the ML model to tell it that the situation is different, it will initially try the previous strategy and perform worse, and it will be difficult to train it to handle both versions of the game without some discriminating input value to train on.

samsartor · on June 5, 2024

Well that is the whole trick. ML models ideally generalize from the training inputs to whatever new inputs show up during inference. For example, a vision model should recognize an image of a dog as a dog even if that exact image was not trained on. But that generalization always has limits. Usually score will decrease substantially the further "out-of-domain" the inputs are. So this model works fine when running a randomly generated dungeon it has never seen, but not when running a set of game rules it has never seen.

atrus · on June 5, 2024

It still worked, just not as well. The program was trained, built it's intuition for the game. It had a set script and assumptions.

Some of those assumptions were different, and since it's not learning/training it couldn't adjust for those new assumptions, so it didn't do as well

If you a human, were forced to follow a set script/assumptions the same would happen to you.

bravetraveler · on June 5, 2024

Without the AI veil this is: my 'expect' implementation broke because it didn't see what I thought it was supposed to

Or, something that has been going on regularly for decades.

This isn't even close to the quality of the 500 mile email IMO, yet they seem to be doing everything possible to ride those coattails

jsjohnst · on June 5, 2024

Interesting story, if maybe a bit oversold. Since the game apparently announces this (didn’t know that), shouldn’t the model have detected a significant difference during the gameplay?

elijaht · on June 5, 2024

The article mentions that “It simply doesn't have data about full moon variables in its training data, so a branching series of decisions likely leads to lesser outcomes, or just confusion.”

thaumasiotes · on June 5, 2024

It does say that, but I'd like to know more about the mechanisms involved. The problem is that every action you take generates better results than you'd expect. I can see why prediction would be worse while that's going on, but why would performance be reduced?

seanhunter · on June 5, 2024

Nethack is weird. It's literally a game comprised of a huge number of special cases all welded together. A lot of the crucial info for playing the game is not really discoverable through the interface, so Humans used to mostly learn the game by reading the source code and now they mostly learn the game by reading guides.

The bug here is not in nethack but just in the training, which meant some of the special cases (full moons, fri 13ths etc) weren't in the training data. They should have been running the training in VMs with the clock set to include these cases.

Honestly a lot of the reporting of this "bug" seems wildly overblown.

andersource · on June 5, 2024

I can imagine it's a bit like if gravity suddenly changed to 0.9g. Everything(?) is easier but a lot of people would probably stumble around a bit before muscle memory, coordination etc. get used to the change.

bravetraveler · on June 5, 2024

I believe games like NetHack have monkey paw situations where fortune may not actually be that fortunate

My memory has not lasted well

WJW · on June 5, 2024

Well if it was only trained on non-full moon days, even if the model did detect a difference it would have no idea how to adapt its play style.

jsjohnst · on June 5, 2024

As another commenter said, it’s quite obvious about it. For such a key difference, I’d think they’d have the model run record a log for later inspection. Or they’d be watching the game play. Or something similar.

WJW · on June 5, 2024

If the model was trained over (say) a week during there was no full moon, how would the model know what this unexpected message means? It would probably just ignore it and continue playing as normal. Nethack is full of messages that can be safely ignored, so just another one would not be unusual.

I don't agree that the creators would be watching the game play either. Usually during such training phases you'd run as many copies of the game as the available hardware can manage. I wouldn't be surprised if they had at least hundreds of runs going in parallel and the researcher is definitely not going to be watching them all. If anything, they are going to bed and train the model overnight as much as possible.

jsjohnst · on June 6, 2024

> how would the model know what this unexpected message means? It would probably just ignore it and continue playing as normal.

Then that’s a poor model then. Significant anomalies should be flagged for manual review, otherwise you corrupt datasets unintentionally.

thaumasiotes · on June 5, 2024

According to the article, the effect is that actions have better outcomes than they otherwise would.

Assuming you don't adapt your playstyle in any way, how would that lead to worse overall performance?

pixl97 · on June 5, 2024

Not really a nethack player, but in the comments on the article a player states this

>edit: aha, I think this is it - attacked werecreatures are much more likely to summon help on full moons. Poor bot probably got overrun.

It may have better outcomes in most situations, but if you're depending heavily on a particular strategy and that changes then you're in trouble.

Tepix · on June 5, 2024

I'm surprised that the researches had played so little nethack that they didn't know about this.

By the way, this story is several weeks old, ars is late in covering it.

eru · on June 5, 2024

I've played quite a lot of NetHack, but I can imagine not thinking about this, when in the middle of debugging.

NetHack does warn you when you (re)-start a session that it's full moon (or new moon, which also has effects).

Ladsko · on June 5, 2024

I have no knowledge of the game. In one tweet @CupiaBart writes:

> Maximizing the score means that you will just farm monsters. Finding items required for ascention or even Just doing a quest is too much for pure RL agent.

So what is the stop condition then? Elapsed time? Does it run out of monsters sooner because the fullmoon makes "werecreatures mostly kept to their animal forms" and there are simply less easily farmable points in early levels?

tux3 · on June 5, 2024

The stop condition is that the bots just die miserably after sitting around, because they run out of items, or the monsters scale faster than them, or they run into one of the thousands of funny/stupid ways you can die in NetHack.

There is a LOT of knowledge and strategy that is VERY FAR from obvious in this game. "Unspoiled" players who haven't read the wiki only have a very faint chance of winning the game.

If you sit around in early levels without trying to make progress, you eventually run out of food, your equipment will not improve and may even degrade, and worst of all you level up, which means monsters start scaling faster than you. You have to rely on prayers to survive, but prayers have a random cooldown, and if you pray too early, your god will make sure you regret it.

flaminHotSpeedo · on June 7, 2024

The literal whole point of the game (as described by the opening text) is to retrieve the Amulet of Yendor and offer it to your chosen "god" (called ascension).

If score is not tied to progress in the game, I'd say the agent's scoring system is, by definition, incorrect.

TeMPOraL · on June 5, 2024

Interesting case, but the article itself really feels like a submarine ad for Singularity, whatever that is.

tantalor · on June 5, 2024

If anything it gave them a false sense of confidence:

> our whole environment is in a single, self-contained file

Except the parts of the environment which aren't in that file, like the current date.

abetusk · on June 5, 2024

One of the best pieces of advice I ever got was "look at your data".

Sometimes it's hard, sometimes you need to come up with novel visualizations, sometimes it can only give partial insight or just be noise, but I always strive, where I can, to have some type of method to look at data, in as close of a form as it was generated as possible.

I think in this instance, look at an actual run might have caught this issue earlier. They might have seen the "it's a full moon" message and might think that's odd, or see were-creatures keeping their form or seeing agents being extra lucky, or whatever, but running it headless and just looking at the score means they're cutting off a huge signal vector (and, admittedly, a huge noise vector).

jmclnx · on June 5, 2024

A quick note, junethack has started, join the fun:

https://junethack.net/

Just started a game, I got:

"Be careful! New moon tonight."

I wonder if that will affect this learning test ?

flaminHotSpeedo · on June 7, 2024

The title here is simply wrong, there's no nuance that could possibly make it correct.

Nethack, by design, changes the gameplay (slightly for some characters, greatly for others) based on the phase of the moon.

It tells you this, the same way the game tells you everything else.

There is no sane justification to call this a bug on either side, it's just a poorly trained model responding poorly to a feature it hadn't seen in training

tokai · on June 5, 2024

RL agents can't ascent? That's something other nethack bots solved a decade ago. They seem to have a really bad grasp of the game. In any case a badly fitted model is not a bug.

oriel · on June 5, 2024

Unfortunately pithy response (in abstract, not specific to this case) boils down to: Because they fired the people who how the system worked.

smokel · on June 5, 2024

For some more information on solving Nethack with AI, check out the (outstanding) TalkRL podcast:

https://www.talkrl.com/episodes/pierluca-doro-and-martin-kli...

flobosg · on June 5, 2024

> What a terrible night to have a learning model.

They blew it: It’s “What a horrible night…”

picometer · on June 5, 2024

Question for ML/AI people: is it still common to use the term “overfitting” for these cases, where a model is overtrained on one thing, to the detriment of another? Or is that term only used for literal curve fitting?

DerekL · on June 5, 2024

Here's the previous discussion of the original Twitter thread: https://news.ycombinator.com/item?id=40472226.

zer00eyz · on June 5, 2024

This is the older article: https://news.ycombinator.com/item?id=28686016

Sniffnoy · on June 5, 2024

That seems to be an entirely unrelated full moon story?

zer00eyz · on June 5, 2024

Wrong link on my part: https://news.ycombinator.com/item?id=40472226

nrclark · on June 5, 2024

this kind: https://tron.fandom.com/wiki/Gridbug

nottorp · on June 5, 2024

It trained itself on its own gameplay? :)

tantalor · on June 5, 2024

That's how people get better at video games too.

nottorp · on June 5, 2024

From TFA:

Of course, "score" is not a real metric for success in NetHack, as Cupiał himself noted. Ask a model to get the best score, and it will farm the heck out of early-stage monsters because it never gets bored. "Finding items required for [ascension] or even [just] doing a quest is too much for pure RL agent," Cupiał wrote. Another neural network, AutoAscend, does a better job of progressing through the game, but "even it can only solve sokoban and reach mines end," Cupiał notes.

The NN seems to be good at grinding. They should make some for those free to play games.

PhasmaFelis · on June 5, 2024

This is why idle games are the wave of the future. They play themselves with minimal input, so you can spend your time doing other things, like playing better games.

sva_ · on June 5, 2024

Meh, RL really lacks robustness against changes in the environment if they weren't in the training.

fractal618 · on June 5, 2024

diff (40% worse code, pre-40% worse code)

pdpi · on June 5, 2024

There were no relevant code changes. The important diff here was 40% worse input data vs pre-40% worse input data.

queuebert · on June 5, 2024

Read the article.