Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The tinygrad folks talk about this a lot.

Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

It would be interesting to see model weights comparison of the same model trained with the two to see if they exhibit meaningfully different behavior.



When we update Torch versions, we're required to run a test where the only change is the library change and compare the outputs. We saw a measurable improvement in accuracy by upgrading from torch 2.4.x to 2.7.x.


> we're required to run a test

What do you mean with "we're required to"? Isn't that something you do with all libraries and something you as an engineer want to do, at the very least to prove correctness? Personally I couldn't imagine using a 3rd party library without at least have some basic tests to confirm correctness, even when I use PyTorch I do the same.


> Not that I understand much of what they say, but it appears there are a lot of correctness bugs in pytorch that are flying under the radar, probably having a measurable impact on the results of model quality.

Do you have any links to public thoughts about this? As if it was true, could mean a lot of research could be invalidated, so obviously would make huge news.

Also feels like something that would be relatively easy to make reproducible test cases from, so easy to prove if that's true or not.

And finally if something is easy to validate, and would make huge news, I feel like someone would already have attempted to prove this, and if it was true, would have published something a long time ago.


Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own. The fact that the modeling process itself was broken in some way — but not the assumptions made of the model inputs, or data leakage assumptions, or anything that fundamentally undermines any model produced — has no bearing on the outcome, which is the fact that you got a model that evidently did make accurate predictions.


> Could this really invalidate research? Managing to produce a model that works (assuming you check all of the myriad modeling correctness checkboxes) is sufficient on its own.

In the academic sense, a model that happens to work isn't research; the product of research should be a technique or insight that generalizes.

"Standard technique X doesn't work in domain Y, so we developed modified technique X' that does better" is the fundamental storyline of many machine learning papers, and that could be 'invalidated' if the poor performance of X was caused by a hidden correctness bug avoided by X'.


a lot of research could be invalidated, so obviously would make huge news.

A lot of research is unreproducible crap. That’s not news to anyone. Plus, bugs usually make results worse, not better.


There are many more ways to degrade model performance than to enhance it, so I would expect the vast majority of bugs to lead to artificially reduced accuracy, not artificially increased accuracy.

So if PyTorch is full of numerical flaws, that would likely mean many models with mediocre/borderline performance were discarded (never published) because they just failed to meet the threshold where the authors felt it was worth their time to package it up for a mid-tier conference. A finding that many would-be mediocre papers are actually slightly less mediocre than believed would be an utterly unremarkable conclusion and I believe that's why we haven't seen a bombshell analysis of PyTorch flaws and reproducibility at NeurIPS.

A software error in, say, a stats routine or a data preprocessing routine would be a different story because the degrees of freedom are fewer, leaving a greater probability of an error hitting a path that pushes a result to look artificially better as opposed to artificially worse


Check their Twitter, I saw something either yesterday or earlier today iirc


That's why project like nanochat are really cool, you can get around the limitations of such gigantic libraries, while at the same time understanding the underlying architecture.


Nanochat is using PyTorch under the hood. I don’t understand your point.


They might be referring to Karpathy's earlier micrograd tutorial, where the whole thing is built from scratch. That was how I learned the basics myself.


https://moyix.blogspot.com/2022/09/someones-been-messing-wit...

TLDR: Python gevent compiled with -Ofast messes up x87 floating point unit state. Bad for PyTorch.


I thought that the effect of these compiler flags was widely known in numerical computing. It allows e.g., reordering of floating point computations and in general disregards IEEE 754. As such, these results are expected, I’d think.


The unexpected thing in that particular case is that even if you were well aware and avoided the flag when building your numeric code, the way some other non-numeric-computing person compiled some unrelated non-numeric module like "gevent" could result in the fast-math behaviour being applied to your code too. (Happily gcc has now fixed this.)


Widely know amongst very niche groups, most of whom have either been burnt by the issue or heard about someone who has and have it ingrained in their mind out of fear of debugging such a thing.

I’d bet the majority of ML people are unaware, including those doing lower level stuff.


I see another commenter highlighted this:

> The exact same float32 code updates weights on CPU but fails on MPS

It's MPS... Exactly zero research is being impacted. Why doesn't the $3.9T corporation contribute more to torch?


As noted near the end of the article, an Apple employee had already contributed a fix to the bug:

> Checking the latest version revealed the bug was already fixed in v2.4, patched by an ML engineer at Apple last year using almost the exact same approach I’d used.


I mean, some researchers clearly use Apple Silicon for their "cheap and cheerful" runs.


The tinygrad folks talk too much




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: