Its like we have all collectively forgotten that the first release of a hardware or software project is just an expensive beta. I don't buy 1st gen microarches because i'm not interested in paying top dollar to be a tester. What I find odd, is that apparently both intel and AMD have also forgotten this, as Intel seems to be moving toward making their enterprise customers the beta testers, while AMD seems so desperate for marketshare as to have released zen as a volume product before releasing it as a high end one. Meaning that if they have to do a recall, they are both losing their entire margin, as well as having to replace a large number of devices.
I got used to the idea of staying away from the cutting edge in the early 2000's. The original Athlon and Thunderbird processors were groundbreaking in performance, but consumer cooling and PSU design had not caught up with their power and heat output demands - the result was uncomfortably noisy and hot at best. The high end parts really are like Ferraris in that they do take more of your time and care to deal with issues like that.
Now when I look at new machines and components I look for something slightly middle-of-the-road, mainstream-leaning, with low power output but quality components, and later revisions of an architecture, since system performance is a more-than-the-sum-of-its-parts game. I actually priced out a Ryzen 3 build today, but I might be better served with the last gen APUs.
I'm not really all that big a fan of APUs, their niche is decent integrated graphics cheap, but for low-end serveresque or office-box usage I'd go with the Pentium G4560. It has ECC too if that's your thing. Rumors are that 4C8T i3s may be coming with Coffee Lake too - obviously no efficiency step up but it'll be price competitive with Ryzen (right now the i3s and locked i5s usually aren't good value).
You can sometimes still find X99 stuff on clearance too - the 5820K and 6800K are essentially power- and performance-competitive with Ryzen 5 in all respects, plus you also get quad-channel RAM and more lanes, the caveat being they're usually more expensive than Ryzen 5. But, they're also mature at this point, and v3 Xeons should even be starting to come out from server farms within a year or two.
The bigger Coffee Lake parts will be tinkerer's specials. You'll need to delid for sure, I totally bet Intel isn't going to do shit about the TIM problem on SKL-X. It's probably going to be nearly as big a die as X99 was anyway, if they could possibly manage to solder it they really need to do it. Either way it's going to eat a lot of power, but they should be fast. Kaby with more cores is a pretty potent formula.
Of course there's no guarantees, SKL-X turned out to be shit for gaming.
Normally Ryzen 5 and 7 dominate the value charts here of course. Intel's products are really only compelling for people who game at the moment. X99 is good, the 7700K is good, the 7600K is good (really guys, hyperthread only gets you roughly ~10% average speedup in games). The 7350K can be OK for ricer low-end builds - if your game of choice is single-threaded (virtually any Source game) it really cooks, and Microcenter has it for $130 (minus a $30 bundle discount if you want a mobo). The G4560 (2C4T) was a value changer for Intel in the low-end market, it's a good office box and it has ECC support for light server usage too. There are rumors Intel is choking down on production because it's eating up i3 sales, which is hilarious, because a price cut is exactly what 2C4T needs.
But really, if you count out Ryzen and Threadripper that leaves Bulldozer, Fusion processors, and Intel products. Bulldozer is mature but slow and hot and really only works well on things like video encoding. Fusion is a laptop processor, and the Pentium beats it. Intel has nothing worthwhile in the "boring office box" category between the G4560 and the X99 range, their consumer stack is almost only relevant in gaming otherwise.
Yeah, I was super bummed when I heard Threadripper was going to be on the older stepping. Otherwise I'd be really interested in it (especially with official ECC support). That would be a sick little server box. I'm thinking about doing a NAS build on X99 instead - I can build something basically the size of a DS1817+ but with 2x16 lane single slots free for NVMe sleds/infiniband.
The other thing is I'd love to see more done with Kabini. There definitely is a place for very cheap shitbox computers - you used to be able to get a AM1 mobo + Athlon 5350 CPU for $40 together, you could do a full low-end build for $200. I was really disappointed to see it go. The CPU supports ECC but none of the motherboards do, give me an ECC mITX server board with an onboard SAS controller! It would be very competitive with the Avoton C2550/C2750 line (which are randomly shitting themselves nowadays unfortunately...) and could be used for NAS appliances. DDR3 is dead now... but Kabini can support DDR3L.
If you don't need good single thread performance and don't care about Performance/Watt that much you can get used servers for very little money with lot's of DDR3 memory (128GB+) and these X5670 als dual cpu systems. If you don't do anything fancy you get 80% of performance for 20% of the price of a new system.
You can also buy used Xeon workstations and put in a new GPU card. The main problem is that the cooling isn't especially quiet, and they have proprietary motherboards and power supplies.
Not necessarily. I have a HP Z400 workstation with what appears to be an ATX PSU... but some of the pins are swapped and it won't actually boot with a non-OEM PSU. You can make an adapter, but in general it's not a safe assumption that this class of systems are compatible with off-the-shelf PSUs. These are OEM systems and the only thing that is guaranteed to work is OEM parts.
> Of course there's no guarantees, SKL-X turned out to be shit for gaming.
Only relatively. The performance is good, it just does not beat the i7-7700K. Which is not surprise given current games reliance on fast cores. But you can absolutely game with Skylake-X, and get high FPS.
I'd also stay away from last generation APUs. They are slow, as it's basically bulldozer with a gpu on top. Zen APUs coming out early 2018 might be a different story, and maybe they fixed some bugs till then.
It really is a pity the Pentium G4560 is getting so expensive. Whether that's a production reduction or miners buying it for their boxes, in any case it is with the higher price not really competitive against the Ryzen 3 1200.
AMD's not doing a recall, it works decently enough for most applications. Their response is going to be "if it crashes your application, turn SMT off".
Consider they didn't even do a recall when Phenom had a showstopping TLB bug, they shipped a BIOS patch that disabled TLB entirely.
And remember, Epyc is on a new stepping of the silicon, it's possible this is already fixed on it. (Threadripper is not, however)
I had a similar scenario occur on my Ryzen 1700 w/ Gigabyte B350 Motherboard. Nothing in the BIOS seemed to help, and updating to the latest version of the firmware didn't seem to help much.
Eventually I just looked into some kernel docs decided that setting the IOMMU mode to pt during bootup might work. Specifically, I added the following to my grub config.
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"
Not sure if this will help any of you, but it did completely eliminate the problems I had.
Shameless plug, wrote a blog about my investigations into it:
I bought a custom-built Rzyen-based PC around Easter, and I have experienced some issues with it; I am not sure, though, were to put the blame (CPU, motherboard/firmware, operating system (which is openSUSE Tumbleweed)).
Under heavy load, the machine has performed most gracefully. However, the machine does freeze (almost) completely when left idle for a while (usually > 1 hour). When it happens, it sometimes still responds to pings, but nothing more; if I try to ssh into it from my laptop, I do not even receive a TCP ACK.
Unfortunately, I guess, there is no Kernel Panic, so no memory dump I could inspect or send to somebody who actually knows how to make sense of it.
On the upside, I have gotten into the habit of putting the machine into standby when I leave it alone for more than a couple of minutes, and I was pleasantly surprised that Suspend-to-Disk is a very acceptable option with an M.2 SSD. ;-P
Asus (who built my mainboard) releases firmware updates (which include microcode updates) on a fairly regular basis, so I hope this problem will be fixed eventually. I knew there was a risk of something like this happening when I got this machine, and overall, I am not disappointed. Otherwise, I am _very_ happy with the machine.
> When it happens, it sometimes still responds to pings, but nothing more; if I try to ssh into it from my laptop, I do not even receive a TCP ACK.
Interesting. Pinging is handled entirely by the kernel. I wonder if the ACK code path has to enter and leave userspace before the response comes out the network card?
If disabling C6 (as per other comments) doesn't fix it, one possible place to start would be
1. Something like "while true; do date > file; sync; sleep 1; done" to track when userspace dies. (Okay, a proper C program that does fdatasync() just for that file would probably be better.)
2. Get inspired by http://elixir.free-electrons.com/linux/latest/source/drivers..., which borrows your system's RTC as non-volatile storage with simple write semantics - maybe you can borrow the year/month fields to store the current HH:MM. If on reboot the stored value matches wall clock time from another source (synced with the PC beforehand) at the moment you pull the plug, the kernel was still running. Then you could eg use the whole value to trace kernel locks and major mutexes etc.
3. Actually, if you can ping the kernel and that works, you could abuse the network stack to get data out!! Eg, pinging with certain bits set in the packet could trigger some magic code that returns system info. ICMP PING tunneling is a thing, and you could totally use this.
"But... I just want to use my computer..."
That's fair - waiting for someone else to fix everything works too :P
> Interesting. Pinging is handled entirely by the kernel. I wonder if the ACK code path has to enter and leave userspace before the response comes out the network card?
Normally there is an incoming backlog, i.e. the kernel accepts (SYN/ACKs) incoming connections before accept(2) happens; accept(2) just pops a connection off the backlog.
I had a similar problem with my ryzen 1700/asrock x370 taichi: Every time I left computer idle, when I came back it was frozen. I didn't try pinging to see if it responded.
What solved the problem for me completely was blacklisting the nouveau module for nvidia(about two weeks without a single freeze). In my case it was an option because I use an AMD gpu for linux and a nvidia for passthrough, so I have no need for the nouveau module to be loaded. BTW this is where I got the hint: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085
What I got from this is that it is not an nouveau driver bug, but rather a hardware bug that has a greater chance of being triggered if the module is loaded. The same thread suggests that disabling ASLR is also a valid workaround.
Unfortunately, I have an nVidia GPU, and I use the nouveau driver, because apparently using nVidia's proprietary driver on Tumbleweed is a huge PITA. Maybe I will have to bite that bullet eventually, though.
I'll definitely try this. I downgraded my latest BIOS because I couldn't stand the random crashes that had no rhyme or reason (crashed during gaming, youtube, encrypting my drives, emailing, browsing, among other activities).
The first post on on AMD's community forum (https://community.amd.com/thread/215773?start=0&tstart=0) is almost three months old, so AMD have known about this for a long time. If it's not something that can be fixed in a UEFI update, then it's bad news for everyone: a weakened AMD means more stagnation in amd64
So he's got a segfault every couple of minutes, wow... I've been running the same test for over 4 hours now on my Ryzen 1700 (and I've had several uneventful 30-40 minute runs before). To date, I only got one "internal compiler error: Illegal instruction" but no segfaults.
Whatever it is, it doesn't affect every chip the same way.
Whatever it is, it doesn't affect every chip the same way
If it is marginal timing in some part of the chip, that combined with statistical process and environment variations, and the increasingly tiny geometries (which serve to amplify the variation) mean the problem could really occur quite randomly. Modern CPUs are pushing the limits in more ways than one, and IMHO this is what happens when they go too far.
That's kind of how interrupts work though. It's a random disruption of normal control flow. Something like this could fail a different way every time you run it.
If IRETQ is failing to RET under load then this is a huge issue. How do you even fix that? Load balance interrupt processing across cores? Throttle any process in a tight loop? It's all ugly hacks. AMD needs to fix this.
Not the first bug reported for Ryzen. Wasn't there a couple of others too, one with linux locking up and another triggered by the ocaml compiler emitting opcodes refering AH/BH/CH/DH registers in a tight loop?
Edit: Sorry, the ocaml bug was Intel Skylake. It's interesting how so many new CPUs have breaking bugs. Feels like it's been quiet since the original pentium F00F bug and then all of a sudden everyone's new CPUs break.
it mentions something that I really worried about - Intel knowingly requested its own validation efforts to be reduced so they can place half baked products onto the market faster. pretty sad that both Intel/AMD are seem to be cutting corners to just be a little bit faster in the competition.
> When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.
I have a Ryzen machine I bought and put into a Jenkins cluster. It does builds on a bunch of VMs occasionally and pegs the CPUs. I have had no issue so far.
It probably affects Windows and every other OS as well. It's just less likely to be seen by their typical user workloads, or more likely to be written off as some other issue when it does happen.
For some reason my boss was super gung-ho about Ryzen and so now I have a work PC that randomly freezes at unknown intervals. Sometimes twice in a day, but usually more like once a week or week and a half. They're hard freezes, and typically nothing gets logged - you see a normal entry in syslog for a normal system event at a random time and then the next entry will be from when you got in to work and had to hard reset. Pretty sure at this point it's either CPU or Mobo related (thought we don't have the tools to verify the power supply under load), but no real means of diagnosing the problem.
I wish motherboards had built in PSU testers. It would be super simple to have a cheap ADC measuring the voltage on every power line all the time, and then have some way that software could access the current voltage and minimum and maximum seen in the last few seconds.
That could then be paired with a bios which displays a boot time resettable warning if the PSU has been misbehaving.
I don't think that is the case for this particular issue. There may be other reported SMT bugs (still some instability) but here it has to be a pair of HTs:
if one hyperthread is in a cpu-bound loop of any kind
(can be in user mode), and the other hyperthread is
returning from an interrupt via IRETQ ...
I'm thinking of building a new PC this autumn or spring but news like this about the new AMD processors are making me a bit uneasy committing to them. I don't think bugs like this and the one with GCC crashing (might be the same) might affect me but it's a risk I'm not sure if I want to take.
Original Ryzen is B1 stepping. EPYC should be already a newer one, with ThreadRipper some say it's the old B1. So you might be looking for B3/C stepping instead ;-)