AVX register corruption from signal delivery

KindOne · on Nov 27, 2019

Reminds me of this post from 2017:

"Debugging an evil Go runtime bug" - https://news.ycombinator.com/item?id=15845118

https://github.com/golang/go/issues/20427

https://bugs.gentoo.org/637152

https://lkml.org/lkml/2017/11/10/188

jameskilton · on Nov 27, 2019

The issue that led to this bug report: https://github.com/golang/go/issues/35326

ronsor · on Nov 27, 2019

It's awful strange. Usually if your program crashes, it's your fault, not the kernel's.

gumby · on Nov 27, 2019

When you support a compiler you get a massive stream of users complaining the at at the compiler has a bug, though in most cases it’s the user’s bug.

Of course there are also good reports where the user actually has found a bug in the compiler.

Though of course most are weird corner case or bugs in new features, there are a surprising number that make you think “wow, how could any program be successfully compiled which this bug in the tree?”

cblum · on Nov 27, 2019

I found a bug in the .NET compiler back when I worked at Microsoft, and at first no one wanted to believe me :)

It manifested when you had a single static member of a specific generic type in a class. The program crashed complaining about invalid CLR instructions. If you added a second static member of the same type to the class, or changed the type of the generic parameter, it didn't reproduce.

Turns out it was related to how the compiler used AVX intrinsics on CPUs that supported those instructions.

Pretty fun but took some convincing for people to believe it was a compiler bug.

chrisseaton · on Nov 27, 2019

> and at first no one wanted to believe me :)

I don't know why people think compilers are so infallible, or are more likely to be better written than your application. If people write bugs in applications guess what the same people write similar bugs in compilers too.

pingyong · on Nov 27, 2019

Probably because everyone still has the numerous starting experiences in the back of their head where they thought their code must be correct and it couldn't possibly your fault only to realize the next day that it was absolutely your fault.

It definitely took me a while to realize that VS was just always executing the default statement of a switch (instead of an applying case) if the default was at the top of the switch instead of the bottom, and only when the switch was being executed at compile time in a constexpr context.

bayindirh · on Nov 27, 2019

> I don't know why people think compilers are so infallible, or are more likely to be better written than your application.

Since they are rare and written by relatively small and knowledgeable people, most of the users (incl. me) think that they're meticulously tested and developed with utmost care.

I just remembered that one of my applications were hanging in a hot code path if I didn't write a small debug directive in the middle of it. I always thought the problem was with me but, it can be anywhere between me and the silicon. When you're in the middle of the development heat, you always blame your code or the libraries you use. Compiler and the lower levels are not put in the suspect list until bug becomes fairly stubborn to persist.

patrec · on Nov 27, 2019

How many bugs did you find in (production versions of) your applications? How many in compilers?

bregma · on Nov 27, 2019

I maintain compilers for a living for a safety-critical embedded OS. I've found dozens of bugs in the compiler, dozens of bugs in the OS kernel, and dozens of bugs in the third-party validation test suites we use to qualify the compiler.

I also live in a log cabin in the back woods and can go off-grid. I've seen shit you people would not believe. It's just a matter of time now. Dominoes.

SomeHacker44 · on Nov 28, 2019

Attack ships on fire off the shoulder of Orion yet?

chrisseaton · on Nov 27, 2019

I write compilers so I find tons of bugs in them, and in other people’s compilers I’m looking at, all the time.

craz8 · on Nov 27, 2019

There was code in IE 5 that used the original SSE instructions if the chip had them, and regular code as a fallback

We found a stepping of a 486 chip where this code crashed about 25% of the time

Since it was only that stepping, and we already had a fallback path, we just skipped the optimization for that chip version and didn’t investigate exactly what was broken though

Having a lot of customers with various CPU versions helped to track this down pretty quickly

mrb · on Nov 27, 2019

Surely you meant x87 FP instructions? The 486 didn't have SSE instructions.

craz8 · on Nov 27, 2019

Probably not 486 - this was 1999-2000 timeframe, so some sort of Pentium thing.

It was definitely the new instructions though, so whatever Intel chip had the new features had this 1 stepping that had this bug (and it wasn’t the first version of the chip either)

gpderetta · on Nov 27, 2019

MMX then instead of SSE?

wahern · on Nov 28, 2019

486 didn't necessarily have FP, either. I started using Linux on a 486SX. FP instructions were trapped and emulated by the kernel.

asveikau · on Nov 27, 2019

Additionally when you work on something that is used by a lot of people or machines, you get to see that memory and disk problems actually happen a lot in terms of raw numbers. Multiply a tiny, miniscule percentage by a lot of runs and they surface.

It is tough to not get overconfident in this diagnosis. If your code happens to see hardware problems on a routine basis, and a real bug surfaces, it is very challenging to not dismiss the latter for the former. It likely took impressive work to diagnose this as far as they have.

Gibbon1 · on Nov 27, 2019

I remember an old game developer mentioning that they put a test for flaky cache memory in their installer and hacked up an error if it failed. Which significantly cut down the number of support calls.

Gibbon1 · on Nov 27, 2019

I misremembered, not support calls but bogus crash reports from flaky cache. Bogus crash reports like this are terrible because they aren't caused by bugs. And it's impossible to prove a negative.

rurban · on Nov 27, 2019

It's just one of those gcc-9 regressions. I've blacklisted it everywhere I can. I had too many of those.

YayamiOmate · on Nov 27, 2019

Usually the program is ill formed not a compiler (optimizer) bugged.

Because odds of user vs tool error are so great the software is commonly used. Otherwise it wouldn't be so popular.

xxs · on Nov 27, 2019

Compiler bugs to happen - it's especially pronounced with JITs and optimize/deoptimize/OSR, etc.

In most cases the application is not meant to crash with any fault.

seminatl · on Nov 27, 2019

Usually but not always. You may get SIGILL if your machine has error detection and correction and your program is executing from a bad page.

dpc_pw · on Nov 27, 2019

Reminds me about one time when I've spent a week debugging random, but quite consistent kernel crashes, which turned out to be a gcc miscompiling kernel driver code to decrement stack pointer before ceasing to use some values in that stack area. There was one or two instructions, where if a re-entrant irq happened, would reuse that stack part and corrupt data there.

Vogtinator · on Nov 27, 2019

Sounds like the AMD64 red zone, which can't be used in the kernel context.

dpc_pw · on Dec 1, 2019

Aarch64. It was just a minor gcc bug.

saagarjha · on Nov 27, 2019

> To reproduce, build the attached program with "gcc -pthread test.c" and run it on a 5.2 or later kernel compiled with GCC 9 (if the kernel is compiled with GCC 8, it does not reproduce).

I wonder if this is a compiler bug or a new optimization that broke the code.

asveikau · on Nov 27, 2019

From the link, it seems like GCC 8 does not cache a read from a variable, and has more memory access to read it, while GCC 9 reads that variable from a register every time. (Maybe from a corrupted register?)

zaarn · on Nov 27, 2019

From what I can tell, the issue is that GCC9 stores the result of a pointer dereference in a register to reuse on each loop operation but the loop operation needs to dereference the pointer each time to work correctly.

Asooka · on Nov 27, 2019

Sounds like it needs to be volatile then, or rather whatever atomic memory read the kernel has.

zaarn · on Nov 27, 2019

Probably volatile. Likely doesn't need to be atomic, just needs to prevent the compiler from optimizing the code into bugs.

asveikau · on Nov 27, 2019

No, it sounds like it needs the kernel not to corrupt the value. There is nothing wrong with leaving the value in the register.

wahern · on Nov 28, 2019

I believe the aforementioned pointer caching is in the kernel--it's the pointer to the FP register state which is cached across preemption points in the kernel.

asveikau · on Nov 28, 2019

Ok I see, that's a good point. Was not spelled out this well in the bug tracker comment. (If it is now it wasn't when I read it.)

zaarn · on Nov 27, 2019

Probably a mix of both; when developing a kernel, you constantly fight against the compiler trying to be smart and optiziming things out of your code, inlining things, caching data, etc.

In this case, it might be necessary to do a volatile read in case that flag has changed, forcing the compiler to reload it.

kakkoko · on Nov 27, 2019

How about another register? (XMM, FPU, etc.)