Hacker News new | past | comments | ask | show | jobs | submit login
AVX register corruption from signal delivery (kernel.org)
159 points by est31 on Nov 26, 2019 | hide | past | favorite | 37 comments




The issue that led to this bug report: https://github.com/golang/go/issues/35326


It's awful strange. Usually if your program crashes, it's your fault, not the kernel's.


When you support a compiler you get a massive stream of users complaining the at at the compiler has a bug, though in most cases it’s the user’s bug.

Of course there are also good reports where the user actually has found a bug in the compiler.

Though of course most are weird corner case or bugs in new features, there are a surprising number that make you think “wow, how could any program be successfully compiled which this bug in the tree?”


I found a bug in the .NET compiler back when I worked at Microsoft, and at first no one wanted to believe me :)

It manifested when you had a single static member of a specific generic type in a class. The program crashed complaining about invalid CLR instructions. If you added a second static member of the same type to the class, or changed the type of the generic parameter, it didn't reproduce.

Turns out it was related to how the compiler used AVX intrinsics on CPUs that supported those instructions.

Pretty fun but took some convincing for people to believe it was a compiler bug.


> and at first no one wanted to believe me :)

I don't know why people think compilers are so infallible, or are more likely to be better written than your application. If people write bugs in applications guess what the same people write similar bugs in compilers too.


Probably because everyone still has the numerous starting experiences in the back of their head where they thought their code must be correct and it couldn't possibly your fault only to realize the next day that it was absolutely your fault.

It definitely took me a while to realize that VS was just always executing the default statement of a switch (instead of an applying case) if the default was at the top of the switch instead of the bottom, and only when the switch was being executed at compile time in a constexpr context.


> I don't know why people think compilers are so infallible, or are more likely to be better written than your application.

Since they are rare and written by relatively small and knowledgeable people, most of the users (incl. me) think that they're meticulously tested and developed with utmost care.

I just remembered that one of my applications were hanging in a hot code path if I didn't write a small debug directive in the middle of it. I always thought the problem was with me but, it can be anywhere between me and the silicon. When you're in the middle of the development heat, you always blame your code or the libraries you use. Compiler and the lower levels are not put in the suspect list until bug becomes fairly stubborn to persist.


How many bugs did you find in (production versions of) your applications? How many in compilers?


I maintain compilers for a living for a safety-critical embedded OS. I've found dozens of bugs in the compiler, dozens of bugs in the OS kernel, and dozens of bugs in the third-party validation test suites we use to qualify the compiler.

I also live in a log cabin in the back woods and can go off-grid. I've seen shit you people would not believe. It's just a matter of time now. Dominoes.


Attack ships on fire off the shoulder of Orion yet?


I write compilers so I find tons of bugs in them, and in other people’s compilers I’m looking at, all the time.


There was code in IE 5 that used the original SSE instructions if the chip had them, and regular code as a fallback

We found a stepping of a 486 chip where this code crashed about 25% of the time

Since it was only that stepping, and we already had a fallback path, we just skipped the optimization for that chip version and didn’t investigate exactly what was broken though

Having a lot of customers with various CPU versions helped to track this down pretty quickly


Surely you meant x87 FP instructions? The 486 didn't have SSE instructions.


Probably not 486 - this was 1999-2000 timeframe, so some sort of Pentium thing.

It was definitely the new instructions though, so whatever Intel chip had the new features had this 1 stepping that had this bug (and it wasn’t the first version of the chip either)


MMX then instead of SSE?


486 didn't necessarily have FP, either. I started using Linux on a 486SX. FP instructions were trapped and emulated by the kernel.


Additionally when you work on something that is used by a lot of people or machines, you get to see that memory and disk problems actually happen a lot in terms of raw numbers. Multiply a tiny, miniscule percentage by a lot of runs and they surface.

It is tough to not get overconfident in this diagnosis. If your code happens to see hardware problems on a routine basis, and a real bug surfaces, it is very challenging to not dismiss the latter for the former. It likely took impressive work to diagnose this as far as they have.


I remember an old game developer mentioning that they put a test for flaky cache memory in their installer and hacked up an error if it failed. Which significantly cut down the number of support calls.


I misremembered, not support calls but bogus crash reports from flaky cache. Bogus crash reports like this are terrible because they aren't caused by bugs. And it's impossible to prove a negative.


It's just one of those gcc-9 regressions. I've blacklisted it everywhere I can. I had too many of those.


Usually the program is ill formed not a compiler (optimizer) bugged.

Because odds of user vs tool error are so great the software is commonly used. Otherwise it wouldn't be so popular.


Compiler bugs to happen - it's especially pronounced with JITs and optimize/deoptimize/OSR, etc.

In most cases the application is not meant to crash with any fault.


Usually but not always. You may get SIGILL if your machine has error detection and correction and your program is executing from a bad page.


Reminds me about one time when I've spent a week debugging random, but quite consistent kernel crashes, which turned out to be a gcc miscompiling kernel driver code to decrement stack pointer before ceasing to use some values in that stack area. There was one or two instructions, where if a re-entrant irq happened, would reuse that stack part and corrupt data there.


Sounds like the AMD64 red zone, which can't be used in the kernel context.


Aarch64. It was just a minor gcc bug.


> To reproduce, build the attached program with "gcc -pthread test.c" and run it on a 5.2 or later kernel compiled with GCC 9 (if the kernel is compiled with GCC 8, it does not reproduce).

I wonder if this is a compiler bug or a new optimization that broke the code.


From the link, it seems like GCC 8 does not cache a read from a variable, and has more memory access to read it, while GCC 9 reads that variable from a register every time. (Maybe from a corrupted register?)


From what I can tell, the issue is that GCC9 stores the result of a pointer dereference in a register to reuse on each loop operation but the loop operation needs to dereference the pointer each time to work correctly.


Sounds like it needs to be volatile then, or rather whatever atomic memory read the kernel has.


Probably volatile. Likely doesn't need to be atomic, just needs to prevent the compiler from optimizing the code into bugs.


No, it sounds like it needs the kernel not to corrupt the value. There is nothing wrong with leaving the value in the register.


I believe the aforementioned pointer caching is in the kernel--it's the pointer to the FP register state which is cached across preemption points in the kernel.


Ok I see, that's a good point. Was not spelled out this well in the bug tracker comment. (If it is now it wasn't when I read it.)


Probably a mix of both; when developing a kernel, you constantly fight against the compiler trying to be smart and optiziming things out of your code, inlining things, caching data, etc.

In this case, it might be necessary to do a volatile read in case that flag has changed, forcing the compiler to reload it.


How about another register? (XMM, FPU, etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: