Hacker News new | past | comments | ask | show | jobs | submit login
Ryzen CPU HyperThreads break if 100% busy and interrupted to top of Memory (freebsd.org)
169 points by sounds on Aug 4, 2017 | hide | past | favorite | 66 comments



Its like we have all collectively forgotten that the first release of a hardware or software project is just an expensive beta. I don't buy 1st gen microarches because i'm not interested in paying top dollar to be a tester. What I find odd, is that apparently both intel and AMD have also forgotten this, as Intel seems to be moving toward making their enterprise customers the beta testers, while AMD seems so desperate for marketshare as to have released zen as a volume product before releasing it as a high end one. Meaning that if they have to do a recall, they are both losing their entire margin, as well as having to replace a large number of devices.


I got used to the idea of staying away from the cutting edge in the early 2000's. The original Athlon and Thunderbird processors were groundbreaking in performance, but consumer cooling and PSU design had not caught up with their power and heat output demands - the result was uncomfortably noisy and hot at best. The high end parts really are like Ferraris in that they do take more of your time and care to deal with issues like that.

Now when I look at new machines and components I look for something slightly middle-of-the-road, mainstream-leaning, with low power output but quality components, and later revisions of an architecture, since system performance is a more-than-the-sum-of-its-parts game. I actually priced out a Ryzen 3 build today, but I might be better served with the last gen APUs.


What do you do with it?

I'm not really all that big a fan of APUs, their niche is decent integrated graphics cheap, but for low-end serveresque or office-box usage I'd go with the Pentium G4560. It has ECC too if that's your thing. Rumors are that 4C8T i3s may be coming with Coffee Lake too - obviously no efficiency step up but it'll be price competitive with Ryzen (right now the i3s and locked i5s usually aren't good value).

You can sometimes still find X99 stuff on clearance too - the 5820K and 6800K are essentially power- and performance-competitive with Ryzen 5 in all respects, plus you also get quad-channel RAM and more lanes, the caveat being they're usually more expensive than Ryzen 5. But, they're also mature at this point, and v3 Xeons should even be starting to come out from server farms within a year or two.

The bigger Coffee Lake parts will be tinkerer's specials. You'll need to delid for sure, I totally bet Intel isn't going to do shit about the TIM problem on SKL-X. It's probably going to be nearly as big a die as X99 was anyway, if they could possibly manage to solder it they really need to do it. Either way it's going to eat a lot of power, but they should be fast. Kaby with more cores is a pretty potent formula.

Of course there's no guarantees, SKL-X turned out to be shit for gaming.

Normally Ryzen 5 and 7 dominate the value charts here of course. Intel's products are really only compelling for people who game at the moment. X99 is good, the 7700K is good, the 7600K is good (really guys, hyperthread only gets you roughly ~10% average speedup in games). The 7350K can be OK for ricer low-end builds - if your game of choice is single-threaded (virtually any Source game) it really cooks, and Microcenter has it for $130 (minus a $30 bundle discount if you want a mobo). The G4560 (2C4T) was a value changer for Intel in the low-end market, it's a good office box and it has ECC support for light server usage too. There are rumors Intel is choking down on production because it's eating up i3 sales, which is hilarious, because a price cut is exactly what 2C4T needs.

But really, if you count out Ryzen and Threadripper that leaves Bulldozer, Fusion processors, and Intel products. Bulldozer is mature but slow and hot and really only works well on things like video encoding. Fusion is a laptop processor, and the Pentium beats it. Intel has nothing worthwhile in the "boring office box" category between the G4560 and the X99 range, their consumer stack is almost only relevant in gaming otherwise.

Yeah, I was super bummed when I heard Threadripper was going to be on the older stepping. Otherwise I'd be really interested in it (especially with official ECC support). That would be a sick little server box. I'm thinking about doing a NAS build on X99 instead - I can build something basically the size of a DS1817+ but with 2x16 lane single slots free for NVMe sleds/infiniband.

https://pcpartpicker.com/list/dBPWqk <-- this but with a used processor

The other thing is I'd love to see more done with Kabini. There definitely is a place for very cheap shitbox computers - you used to be able to get a AM1 mobo + Athlon 5350 CPU for $40 together, you could do a full low-end build for $200. I was really disappointed to see it go. The CPU supports ECC but none of the motherboards do, give me an ECC mITX server board with an onboard SAS controller! It would be very competitive with the Avoton C2550/C2750 line (which are randomly shitting themselves nowadays unfortunately...) and could be used for NAS appliances. DDR3 is dead now... but Kabini can support DDR3L.


Someone was chatting to me today about picking up X5670 cpus for $50-$60. Compared to my 6600K they seem quite competitive: http://ark.intel.com/compare/83352,88191,47920


If you don't need good single thread performance and don't care about Performance/Watt that much you can get used servers for very little money with lot's of DDR3 memory (128GB+) and these X5670 als dual cpu systems. If you don't do anything fancy you get 80% of performance for 20% of the price of a new system.


Or PCIE performance since the PCIE is still on the chipset on those CPUs.


You can also buy used Xeon workstations and put in a new GPU card. The main problem is that the cooling isn't especially quiet, and they have proprietary motherboards and power supplies.


The PSUs are still essentially ATX, so that's usually not a big problem.

Form factor on the other hand... well. Sometimes they really go off the reservation there.


Not necessarily. I have a HP Z400 workstation with what appears to be an ATX PSU... but some of the pins are swapped and it won't actually boot with a non-OEM PSU. You can make an adapter, but in general it's not a safe assumption that this class of systems are compatible with off-the-shelf PSUs. These are OEM systems and the only thing that is guaranteed to work is OEM parts.

https://www.badcaps.net/forum/showthread.php?t=47754


> Of course there's no guarantees, SKL-X turned out to be shit for gaming.

Only relatively. The performance is good, it just does not beat the i7-7700K. Which is not surprise given current games reliance on fast cores. But you can absolutely game with Skylake-X, and get high FPS.

I'd also stay away from last generation APUs. They are slow, as it's basically bulldozer with a gpu on top. Zen APUs coming out early 2018 might be a different story, and maybe they fixed some bugs till then.

It really is a pity the Pentium G4560 is getting so expensive. Whether that's a production reduction or miners buying it for their boxes, in any case it is with the higher price not really competitive against the Ryzen 3 1200.


AMD's not doing a recall, it works decently enough for most applications. Their response is going to be "if it crashes your application, turn SMT off".

Consider they didn't even do a recall when Phenom had a showstopping TLB bug, they shipped a BIOS patch that disabled TLB entirely.

And remember, Epyc is on a new stepping of the silicon, it's possible this is already fixed on it. (Threadripper is not, however)


> Their response is going to be "if it crashes your application, turn SMT off".

it happens when SMT is disabled

>Epyc is on a new stepping of the silicon, it's possible this is already fixed on it. (Threadripper is not, however)

that's assuming they caught this bug, which i doubt is the case because it's only discovered now rather than being documented in the errata.


I dunno, there were several issues in the SMT implementation publicized earlier, it is entirely possible that the root cause is the same or related.


its possible they overlooked mentioning it in the errata but still fixed it with their stepping. dont give up last hope :)


Epyc crashes quite often, see segfault screen shot.

https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_minin...


TLB, As Translation Lookaside Buffer? Won't memory access slow down dramatically if that is disabled?



The Phenom bug was about the L2 TLB; the L1 TLB worked just fine. It still decreased performance significantly — about 5-20 % depending on workload.


As of now, threadripper is epyc silicon with 4 ccx disabled


No! I was going to buy threadripper for this very reason.


I had a similar scenario occur on my Ryzen 1700 w/ Gigabyte B350 Motherboard. Nothing in the BIOS seemed to help, and updating to the latest version of the firmware didn't seem to help much.

Eventually I just looked into some kernel docs decided that setting the IOMMU mode to pt during bootup might work. Specifically, I added the following to my grub config.

GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"

Not sure if this will help any of you, but it did completely eliminate the problems I had.

Shameless plug, wrote a blog about my investigations into it:

https://ibiscybernetics.com/blog/2017-05-24.html


I bought a custom-built Rzyen-based PC around Easter, and I have experienced some issues with it; I am not sure, though, were to put the blame (CPU, motherboard/firmware, operating system (which is openSUSE Tumbleweed)).

Under heavy load, the machine has performed most gracefully. However, the machine does freeze (almost) completely when left idle for a while (usually > 1 hour). When it happens, it sometimes still responds to pings, but nothing more; if I try to ssh into it from my laptop, I do not even receive a TCP ACK.

Unfortunately, I guess, there is no Kernel Panic, so no memory dump I could inspect or send to somebody who actually knows how to make sense of it.

On the upside, I have gotten into the habit of putting the machine into standby when I leave it alone for more than a couple of minutes, and I was pleasantly surprised that Suspend-to-Disk is a very acceptable option with an M.2 SSD. ;-P

Asus (who built my mainboard) releases firmware updates (which include microcode updates) on a fairly regular basis, so I hope this problem will be fixed eventually. I knew there was a risk of something like this happening when I got this machine, and overall, I am not disappointed. Otherwise, I am _very_ happy with the machine.


> When it happens, it sometimes still responds to pings, but nothing more; if I try to ssh into it from my laptop, I do not even receive a TCP ACK.

Interesting. Pinging is handled entirely by the kernel. I wonder if the ACK code path has to enter and leave userspace before the response comes out the network card?

If disabling C6 (as per other comments) doesn't fix it, one possible place to start would be

1. Something like "while true; do date > file; sync; sleep 1; done" to track when userspace dies. (Okay, a proper C program that does fdatasync() just for that file would probably be better.)

2. Get inspired by http://elixir.free-electrons.com/linux/latest/source/drivers..., which borrows your system's RTC as non-volatile storage with simple write semantics - maybe you can borrow the year/month fields to store the current HH:MM. If on reboot the stored value matches wall clock time from another source (synced with the PC beforehand) at the moment you pull the plug, the kernel was still running. Then you could eg use the whole value to trace kernel locks and major mutexes etc.

3. Actually, if you can ping the kernel and that works, you could abuse the network stack to get data out!! Eg, pinging with certain bits set in the packet could trigger some magic code that returns system info. ICMP PING tunneling is a thing, and you could totally use this.

"But... I just want to use my computer..."

That's fair - waiting for someone else to fix everything works too :P


> Interesting. Pinging is handled entirely by the kernel. I wonder if the ACK code path has to enter and leave userspace before the response comes out the network card?

Normally there is an incoming backlog, i.e. the kernel accepts (SYN/ACKs) incoming connections before accept(2) happens; accept(2) just pops a connection off the backlog.


I had a similar problem with my ryzen 1700/asrock x370 taichi: Every time I left computer idle, when I came back it was frozen. I didn't try pinging to see if it responded.

What solved the problem for me completely was blacklisting the nouveau module for nvidia(about two weeks without a single freeze). In my case it was an option because I use an AMD gpu for linux and a nvidia for passthrough, so I have no need for the nouveau module to be loaded. BTW this is where I got the hint: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085

What I got from this is that it is not an nouveau driver bug, but rather a hardware bug that has a greater chance of being triggered if the module is loaded. The same thread suggests that disabling ASLR is also a valid workaround.


Thanks for that advice!

Unfortunately, I have an nVidia GPU, and I use the nouveau driver, because apparently using nVidia's proprietary driver on Tumbleweed is a huge PITA. Maybe I will have to bite that bullet eventually, though.


> However, the machine does freeze (almost) completely when left idle for a while (usually > 1 hour).

Try disabling the C6 power state from BIOS: https://www.reddit.com/r/hardware/comments/6rklcf/ryzen_segf...


I'll definitely try this. I downgraded my latest BIOS because I couldn't stand the random crashes that had no rhyme or reason (crashed during gaming, youtube, encrypting my drives, emailing, browsing, among other activities).


Thanks for the hint!


I hoped this just affected Ryzen CPUs, but this Reddit post indicates that it affects Epyc also: https://www.reddit.com/r/Amd/comments/6rmq6q/epyc_7551_minin...

The first post on on AMD's community forum (https://community.amd.com/thread/215773?start=0&tstart=0) is almost three months old, so AMD have known about this for a long time. If it's not something that can be fixed in a UEFI update, then it's bad news for everyone: a weakened AMD means more stagnation in amd64


So he's got a segfault every couple of minutes, wow... I've been running the same test for over 4 hours now on my Ryzen 1700 (and I've had several uneventful 30-40 minute runs before). To date, I only got one "internal compiler error: Illegal instruction" but no segfaults.

Whatever it is, it doesn't affect every chip the same way.


Whatever it is, it doesn't affect every chip the same way

If it is marginal timing in some part of the chip, that combined with statistical process and environment variations, and the increasingly tiny geometries (which serve to amplify the variation) mean the problem could really occur quite randomly. Modern CPUs are pushing the limits in more ways than one, and IMHO this is what happens when they go too far.


There were some comments that indicated this scales up with the number of CCXs on a given CPU. https://www.reddit.com/r/Amd/comments/6ltdqd/comment/djx6g9r

There is also a POC for invoking this on Windows on GitHub I'm really wondering what this issue will lead too.


That's kind of how interrupts work though. It's a random disruption of normal control flow. Something like this could fail a different way every time you run it.

If IRETQ is failing to RET under load then this is a huge issue. How do you even fix that? Load balance interrupt processing across cores? Throttle any process in a tight loop? It's all ugly hacks. AMD needs to fix this.


> Whatever it is, it doesn't affect every chip the same way.

Such uncertainty is bad. segfault is not the only issue, the risk of having some interally corrupted data is the biggest risk.

I've stopped using the Ryzen junk I bought on its release day.


Not the first bug reported for Ryzen. Wasn't there a couple of others too, one with linux locking up and another triggered by the ocaml compiler emitting opcodes refering AH/BH/CH/DH registers in a tight loop?

Edit: Sorry, the ocaml bug was Intel Skylake. It's interesting how so many new CPUs have breaking bugs. Feels like it's been quiet since the original pentium F00F bug and then all of a sudden everyone's new CPUs break.


The ocaml compiler found a bug in Intel's Skylake/Kaby Lake. If they found one for Ryzen too, I haven't heard about it.

https://lists.debian.org/debian-devel/2017/06/msg00308.html


You are right, of course.


> It's interesting how so many new CPUs have breaking bugs.

Dan Luu observed the same thing, and thinks it will continue to be very likely: https://danluu.com/cpu-bugs/


cool article, thanks a lot for sharing.

it mentions something that I really worried about - Intel knowingly requested its own validation efforts to be reduced so they can place half baked products onto the market faster. pretty sad that both Intel/AMD are seem to be cutting corners to just be a little bit faster in the competition.


The OCaml thing was Skylake. But at least that one was fixed back in May.


> When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

So maybe fixable with a microcode update?


Most cpu errata are. (Most of the rest are ignorable).


Aren't microcode updates usually "disable some feature" updates?


No. That's exceptionally rare and the worst case scenario. See Intel and TSX (transactional memory).


Phoronix: "Ryzen-Test & Stress-Run Make It Easy To Cause Segmentation Faults On Zen CPUs"

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Te...


newer "50+ Segmentation Faults Per Hour: Continuing To Stress Ryzen" ( 5 August 2017. )

http://www.phoronix.com/scan.php?page=article&item=ryzen-seg...


Does this affect only BSDs then?

I have a Ryzen machine I bought and put into a Jenkins cluster. It does builds on a bunch of VMs occasionally and pegs the CPUs. I have had no issue so far.


It probably affects Windows and every other OS as well. It's just less likely to be seen by their typical user workloads, or more likely to be written off as some other issue when it does happen.


It affects Linux as well.



For some reason my boss was super gung-ho about Ryzen and so now I have a work PC that randomly freezes at unknown intervals. Sometimes twice in a day, but usually more like once a week or week and a half. They're hard freezes, and typically nothing gets logged - you see a normal entry in syslog for a normal system event at a random time and then the next entry will be from when you got in to work and had to hard reset. Pretty sure at this point it's either CPU or Mobo related (thought we don't have the tools to verify the power supply under load), but no real means of diagnosing the problem.


I wish motherboards had built in PSU testers. It would be super simple to have a cheap ADC measuring the voltage on every power line all the time, and then have some way that software could access the current voltage and minimum and maximum seen in the last few seconds.

That could then be paired with a bios which displays a boot time resettable warning if the PSU has been misbehaving.


I wonder if the Ryzen bugs can really be fixed with a microcode update..


They will probably just issue an update to turn off hyperthreading.


Ryzen CPU's aren't made by Intel.


Even with SMT disabled it causes the faults


I don't think that is the case for this particular issue. There may be other reported SMT bugs (still some instability) but here it has to be a pair of HTs:

    if one hyperthread is in a cpu-bound loop of any kind
    (can be in user mode), and the other hyperthread is 
    returning from an interrupt via IRETQ ...


I'm thinking of building a new PC this autumn or spring but news like this about the new AMD processors are making me a bit uneasy committing to them. I don't think bugs like this and the one with GCC crashing (might be the same) might affect me but it's a risk I'm not sure if I want to take.


Note that we still don't really know what's going on; just that this change seems to alleviate the symptoms.


Potentially related to this issue over 5 years ago? https://it.slashdot.org/story/12/03/06/0136243/amd-confirms-...


Seems awfully unlikely.


Dang, I guess I'm going to wait for the B stepping of the Ryzen :-)


Original Ryzen is B1 stepping. EPYC should be already a newer one, with ThreadRipper some say it's the old B1. So you might be looking for B3/C stepping instead ;-)


My mantra for stuff like this: Second best is good enough.


Uh oh, this might be it for AMD this year




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: