This post is fantastic. I wish there was more workstation porn like this for those of us who are not into the RGB light show ripjaw hacksaw aorus elite novelty stuff that gamers are so into. Benchmarks in the community are almost universally focused on gaming performance and FPS.
I want to build an epic rig that will last a long time with professional grade hardware (with ECC memory for instance) and would love to get a lot of the bleeding-edge stuff without compromising on durability. Where do these people hang out online?
Thanks! In case you're interested in building a ThreadRipper Pro WX-based system like mine, then AMD apparently starts selling the CPUs independently from March 2021 onwards:
Previously you could only get this CPU when buying the Lenovo ThinkStation P620 machine. I'm pretty happy with Lenovo Thinkstations though (I bought a P920 with dual Xeons 2.5 years ago)
My only quibble with that board is that I worry about how easily replaced the fan on the chipset is. In my experience that exact type of fan will inevitably fail in a moderately dusty environment... And it doesn't look like anything you could screw on/off from the common industry standard sizes of 40mm or 60mm 12VDC fans that come in various thicknesses.
Fortunately you can often swap the complete heatsink and fan combo on the chipset with a different one. If the mounting method is strange one can use thermal epoxy or thermal adhesive tape.
Even LinusTechTips has some decent content for server hardware, though they stay fairly superficial. And the forum definitely has people who can help out: https://linustechtips.com/
And the thing is, depending on what metric you judge performance by, the enthusiast hardware may very well outperform the server hardware. For something that is sensitive to memory, e.g., you can get much faster RAM in enthusiast SKUs (https://www.crucial.com/memory/ddr4/BLM2K8G51C19U4B) than you'll find in server hardware. Similarly, the HEDT SKUs out-clock the server SKUs for both Intel and AMD.
I have a Threadripper system that outperforms most servers I work with on a daily basis, because most of my workloads, despite being multi-threaded, are sensitive to clockspeed.
No one's using "gamer NICs" for high speed networking. Top of the line "gaming" networking is 802.11ax or 10GbE. 2x200Gb/s NICs are available now.
Gaming parts are strictly single socket - software that can take advantage of >64 cores will need server hardware - either one of the giant Ampere ARM CPUs or a 2+ socket system.
If something must run in RAM and needs TB of RAM, well then it's not even a question of faster or slower. The capability only exists on server platforms.
Some workloads will benefit from the performance characteristics of consumer hardware.
Workstations and desktops are distinct market segments. The machine in the article uses a workstation platform. And the workstation processors available in that Lenovo machine clock slower than something like a 5950X mainstream processor. The RDIMMS you need to get to 1TB in the machine run much slower than the UDIMMS I linked above.
I'm with you on this, I just built a (much more modest than the article's) workstation/homelab machine a few months ago, to replace my previous one which was going on 10 years old and showing its age.
There's some folks in /r/homelab who are into this kind of thing, and I used their advice a fair bit in my build. While it is kind of mixed (there's a lot of people who build pi clusters as their homelab), there's still plenty of people who buy decommissioned "enterprise" hardware and make monstrous-for-home-use things.
Look at purchasing used enterprise hardware. You can buy a reliable x9 or X10 generation supermicro server (rack or tower) for around a couple of hundred.
I've been planning to do this, but enterprise hardware seems like it requires a completely different set of knowledge on how to purchase it and maintain it, and especially as a consumer.
It's not quite as trivial of a barrier to entry as consumer desktops, but I suppose that's the point. Still, it would be nice if there was a guide that could help me make good decisions to start.
Downside of buying enterprise for home use is noise - their turbofan coolers are insanely loud while consumer grade 120mm (Noctua et al) coolers are most quiet.
Another downside is powerconsumption at rest. Supermicro board with 2x Xeon use 80watt at minimum. Add a 10Gbit switch and a few more peripherals and you’re looking at an additional $€80/month electricity bill. Year after year, that is $€10.000 in 10years.
Of course that is nothing compared to what you’d pay at Google/Azure/AWS for the AMD machine of this news item :-)
12V only PSUs like OEMs use or ATX12VO in combination with a motherboard without IPMI, similar to the German Fujitsu motherboards, have significant lower power consumption at rest. Somewhere around 8-10Watt without HDD. Much better for home use IHMO.
In the US, electricity rates are typically much cheaper than the EU. My rate is roughly .08 €/kWh, for example, and I don't get any subsidies to convert to solar, so I have no way to make it pay off for myself within 15 years (beyond the time most people expect to stay in a home here), while other states in the US subsidize so heavily or rates for electricity are so high most people have solar panels at least (see: Hawaii with among the highest costs for electricity in the US).
Regardless of electricity cost, all that electricity usage winds up with a lot of heat in a dwelling. To help offset the energy consumption in the future I plan to use a hybrid water heater that can act as a heat pump and dehumidifier and capture the excess heat as a way to reduce energy consumption for hot water.
It’s mostly about casing though - density is important with enterprise stuff, and noise level is almost irrelevant hence small chassis with small, loud, fans.
I’ve got a 3.5” x16 bay gooxi chassis that I’ve put a supermicro mb + xeon in.
I got this specific nas chassis because it got a fan wall with 3x120mm fans, not because I need bays.
With a few rather cool SSD’s for storage and quiet noctua fans it is barely a whisper.
Also - vertical rack mounting behind a closet door!
I can have a massive chassis that basically takes no place at all. Can’t belive I didn’t figure that one out earlier...
Mostly yes because server chassis are very compact and sometimes use proprietary connectors and fans. Still, many people have done that with good results, have a look in YouTube to know which server models are best suited for that kind of customization.
I've not been successful trying this with HPE servers. Most server fans (Foxconn/Delta) run 2.8 amp or higher.
Not aware of any "silent" gaming grade fans that use more than 0.38 amps.
That's not even considering the CFM.
Amps * Volts is power. Power is a proxy (a moderately good one) for air movement (a mix of volume/mass at a specific [back-]pressure).
It’s not likely that a silent 2W fan will move a similar amount of air as the stock 14W fans. The enterprise gear from HPE is pretty well engineered; I’m skeptical that they over-designed the fans by a 7x factor.
Operating voltage tells you “this fan won’t burn up when you plug it in”. It doesn’t tell you “will keep the components cool”.
Though I have to wonder.... would these be good gaming systems? Are there any scenarios where the perks (stupid numbers of cores, 8-channel memory, 128 PCI-E lanes, etc) would help?
Check out HardForum. Lots of very knowledgable people on there helped me mature my hardware level knowledge. Back when I was building 4 cpu, 64 core opteron systems. Also decent banter.
Happy to help if you want feedback. Servethehome forums are also a great resource of info and used hardware, probably the best community for your needs.
Author here: This article was intended to explain some modern hardware bottlenecks (and non-bottlenecks), but unexpectedly ended up covering a bunch of Linux kernel I/O stack issues as well :-) AMA
I just love this article. Especially when the norm is always about scaling out instead of scaling up. We can have 128 Core CPU, 2TB Memory, PCI-E 4.0 SSD, ( and soon PCI-E 5.0 ). We could even fit a Petabyte in 1U for SSD Storage.
I remember WhatsApp used to operate its 500M user with only a dozen of large FreeBSD boxes. ( Only to be taken apart by Facebook )
So Thank you for raising awareness. Hopefully the pendulum is swinging back to conceptually simple design.
>I also have a 380 GB Intel Optane 905P SSD for low latency writes
I would love to see that. Although I am waiting for someone to do a review on the Optane SSD P5800X [1]. Random 4K IOPS up to 1.5M with lower than 6us Latency.
>> I remember WhatsApp used to operate its 500M user with only a dozen of large FreeBSD boxes.
With 1TB of RAM you can have 256 bytes for every person on earth live in memory. With SSD either as virtual memory or keeping an index in RAM, you can do meaningful work in real time, probably as fast as the network will allow.
When I first moved to the bay area, the company that hired me asked me what kind of computer I wanted and gave me a budget (like $3000 or something)... I spent a few days crafting a parts list so I could build an awesome workstation. Once I sent it over they were like "Uh, we just meant which macbook do you want?" and kind of gave me some shade about it. They joked, so how are you going to do meetings or on call?
I rolled with it, but really wondered if they knew I could get 2x the hardware and have a computer at home and at work for less money than the MBP ... Most of the people didnt seem to understand that laptop CPUs are not the same as desktop/workstation ones, especially when they hit thermal down throttling.
Last but one job boss offered me an iMac Pro, I asked if I could just have the equivalent money for hardware and he said sure.
Which is how I ended up with an absolute monster of a work machine, these days I WFH and while work issued me a Macbook Pro it sits on the shelf behind me.
Fedora on a (still fast) Ryzen/2080 and 2x4K 27" screens vs a Macbook Pro is a hilarious no brainer for me.
Upgrading soon but can't decide whether I need the 5950X or merely want it - realistically except for gaming I'm nowhere near tapping out this machine (and it's still awesome for that an VR which is why the step-son is about to get a in his words "sick" PC).
I mean it would have been a totally valid answer to say that you intended to use a $600 laptop as effectively a thin client, and spend $2400 on a powerful workstation PC to drive remotely.
Was a macbookpro user for a decade+ then dropped laptops for desktop machines, first imacpro, currently ryzen 12 core. I no longer understand why I had a laptop for so long. Status I guess (only talking about me)
You should look at cpu usage. There is a good chance all your interrupts are hitting cpu-0. you can run hwloc to see what chiplet the pci cards are on and handle interrupts on those cores.
Thanks for the "hwloc" tip. I hadn't thought about that.
I was thinking of doing something like that. Weirdly I got sustained throughput differences when I killed & restarted fio. So, if I got 11M IOPS, it stayed at that level until I killed fio & restarted. If I got 10.8M next, it stayed like it until I killed & restarted it.
This makes me think that I'm hitting some PCIe/memory bottleneck, dependent on process placement (which process happens to need to move data across infinity fabric due to accessing data through a "remote" PCIe root complex or something like that). But then I realized that Zen 2 has a central IO hub again, so there shouldn't be a "far edge of I/O" like on current gen Intel CPUs (?)
But there's definitely some workload placement and I/O-memory-interrupt affinity that I've wanted to look into. I could even enable the NUMA-like-mode from BIOS, but again with Zen 2, the memory access goes through the central infinity-fabric chip too, I understand, so not sure if there's any value in trying to achieve memory locality for individual chiplets on this platform (?)
So there are 2 parts to cpu affinity. a)
cpu assigned to ssd for handling interrupts and b) cpu assigned to fio. numactl is your friend for experimenting with with changing fio affinity.
The PCIe is all on a single IO die, but internally it is organized into quadrants that can produce some NUMA effects. So it is probably worth trying out the motherboard firmware settings to expose your CPU as multiple NUMA nodes, and using the FIO options to allocate memory only on the local node, and restricting execution to the right cores.
Yep, I enabled the "numa-like-awareness" in BIOS and ran a few quick tests to see whether the NUMA-aware scheduler/NUMA balancing would do the right thing and migrate processes closer to their memory over time, but didn't notice any benefit. But yep I haven't manually locked down the execution and memory placement yet. This placement may well explain why I saw some ~5% throughput fluctuations only if killing & restarting fio and not while the same test was running.
I have done some tests on AMD servers and I the Linux scheduler does a pretty good job.
I do however get noticeable (a couple percent) better performance by forcing the process to run on the correct numa node.
Make sure you get as many numa domains as possible in your BIOS settings.
I recommend using numactl with the cpu-exclusive and mem-exclusive flags. I have noticed a slight perfomance drop when RAM cache fills beyond the sticks local to the cpus doing work.
One last comment is that you mentioned interrupts being "stiped" among CPUs. I would recommend pinning the interrupts from one disk to one numa-local CPU and using numactl to run fio for that disk on the same CPU.
An additional experiment is to, if you have enough cores, pin interrupts to CPUs local to disk, but use other cores on the same numa node for fio. That has been my most successful setup so far.
I have the same box, but with the 32 core CPU and fewer NVMe drives. I've not poked at all the PCIe slots yet, but all that I've looked at are in NUMA node 1. This includes the on board M.2 slots. It is in NPS=4 mode.
Mine goes only up to 2 NUMA nodes (as shown in numactl --hardware), despite setting NPS4 in BIOS. I guess it's because I have only 2 x 8-core chiplets enabled (?)
I think that in addition to allocating a queue per CPU, you need to be able to allocate a MSI(-X) vector per CPU. That shouldn't be a problem for the Samsung 980 PRO, since it supports 128 queues and 130 interrupt vectors.
Good question. I don't ever read kernel code as a starting point, only if some profiling or tracing tool points me towards an interesting function or codepath. And interesting usually is something that takes most CPU in perf output or some function call with an unusually high latency in ftrace, bcc/bpftrace script output. Or just a stack trace in a core- or crashdump.
As far as mindset goes - I try to apply the developer mindset to system performance. In other words, I don't use much of what I call the "old school sysadmin mindset", from a time where better tooling was not available. I don't use systemwide utilization or various get/hit ratios for doing "metric voodoo" of Unix wizards.
The developer mindset dictates that everything you run is an application. JVM is an application. Kernel is an application. Postgres, Oracle are applications. All applications execute one or more threads that run on CPU or do not run on CPU. There are only two categories of reasons why a thread does not run on CPU (is sleeping): The OS put the thread to sleep (involuntary blocking) or the thread voluntarily wanted to go to go to sleep (for example, it realized it can't get some application level lock).
And you drill down from there. Your OS/system is just a bunch of threads running on CPU, sleeping and sometimes communicating with each other. You can directly measure all of these things easily nowadays with profilers, no need for metric voodoo.
I have written my own tools to complement things like perf, ftrace and BPF stuff - as a consultant I regularly see 10+ year old Linux versions, etc - and I find sampling thread states from /proc file system is a really good (and flexible) starting point for system performance analysis and even some drilldown - all this without having to install new software or upgrading to latest kernels. Some of the tools I showed in my article too:
Excellent article, thank you!
I really like the analysis and profiling part of the evaluation.
I also have some experience in I/O performance in linux -- we measured 30GiB/s in a pcie Gen3 box (shameless plug[0]).
I have one question / comment: did you use multiple jobs for the BW (large IO) experiments? If yes, then did you set randrepeat to 0? I'm asking this because fio by default uses the same sequence of offsets for each job, in which case there might be data re-used across jobs. I had verified that with blktrace a few years back, but it might have changed recently.
Looks interesting! I wonder whether there'd be interesting new database applications on NVMe when doing as small as 512 byte I/Os (with more efficient "IO engine" than Linux bio, that has too high CPU overhead with such small requests).
I mean, currently OLTP RDBMS engines tend to use 4k, 8k (and some) 16k block size and when doing completely random I/O (or, say traversing an index on customer_id that now needs to read random occasional customer orders across years of history). So you may end up reading 1000 x 8 kB blocks just to read 1000 x 100B order records "randomly" scattered across the table from inserts done over the years.
Optane persistent memory can do small, cache line sized I/O I understand, but that's a different topic. When being able to do random 512B I/O on "commodity" NVMe SSDs efficiently, this would open some interesting opportunities for retrieving records that are scattered "randomly" across the disks.
edit: to answer your question, I used 10 separate fio commands with numjobs=3 or 4 for each and randrepeat was set to default.
At Netflix, I'm playing with an EPYC 7502P with 16 NVME and dual 2x100 Mellanox ConnectX6-DX NICs. With hardware kTLS offload, we're able to serve about 350Gb/s of real customer traffic. This goes down to about 240Gb/s when using software kTLS, due to memory bandwidth limits.
>we're able to serve about 350Gb/s of real customer traffic.
I still remember the post about breaking 100Gbps barrier, that was may be in 2016 or 17 ? And wasn't that long ago it was 200Gbps and if I remember correct it was hitting memory bandwidth barrier as well.
And now 350Gbps?!
So what's next? Wait for DDR5? Or moving to some memory controller black magic like POWER10?
Yes, before hardware inline kTLS offload, we were limited to 200Gb/s or so with Naples. With Rome, its a bit higher. But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.
The current bottleneck is IO related, and its unclear what the issue is. We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
> But hardware inline kTLS with the Mellanox CX6-DX eliminates memory bandwidth as a bottleneck.
For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?
I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
> We're working with the hardware vendors to try to figure it out. We should be getting about 390Gb/s
Something I explained to a colleague recently is that a modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air than my first four computers had combined.
You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
> For a while now I had operated under the assumption that CPU-based crypto with AES-GCM was faster than most hardware offload cards. What makes the Mellanox NIC perform better?
> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
It may depend on what you're sending. Netflix's use case is generally sending files. If you're doing software encryption you would load the plain text file into memory (via the filesystem/unified buffer cache), then write the (session specific) encrypted text into separate memory, then tell give that memory to the NIC to send out.
If the NIC can do the encryption, you would load the plain text into memory, then tell the NIC to read from that memory to encrypt and send out. That saves at least a write pass, and probably a read pass. (256 MB of L3 cache on latest EPYC is a lot, but it's not enough to expect cached reads from the filesystem to hit L3 that often, IMHO)
If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.
Not that this is a totally different case from encrypting dynamic data that's necessarily touched by the CPU.
> You're basically complaining that you're unable to get a mere 10% of the expected throughput. But put in absolute terms, that's 40 Gbps, which is about 10x more than what a typical server in 2020 can put out on the network. (Just because you have 10 Gbps NICs doesn't mean you can get 10 Gbps! Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice)
I had no problem serving 10 Gbps of files on a dual Xeon E5-2690 (v1; a 2012 CPU), although that CPU isn't great at AES, so I think it only did 8 Gbps or so with TLS; the next round of servers for that role had 2x 10G and 2690 v3 or v4 (2014 or 2016; but I can't remember when we got them) and thanks to better AES instructions, they were able to do 20 G (and a lot more handshakes/sec too). If your 2020 servers aren't as good as my circa 2012 servers were, you might need to work on your stack. OTOH, bulk file serving for many clients can be different than a single connection iperf.
> If my guestimate is right, a cold file would go from hitting memory 4 times to hitting it twice. And a file in disk cache would go from 3 times to once; the CPU doesn't need to touch the memory if it's in the disk cache.
> I.e.: Why does memory bandwidth matter to TLS? Aren't you encrypting data "on the fly", while it is still resident in the CPU caches?
I assume NF's software pipeline is zero copy, so if TLS is done in the NIC data only gets read from memory once when it is DMA'd to the NIC. With software TLS you need to read the data from memory (assuming it's not already in cache, which given the size of data NF deals with is unlikely), encrypt it, then write it back out to main memory so it can be DMA'd to the NIC. I know Intel has some fancy tech that can DMA directly to/from the CPU's cache, but I don't think AMD has that capability (yet).
> Try iperf3 and you'll be shocked that you're lucky if you can crack 5 Gbps in practice
Easy line rate if you crank the MTU all the way to 9000 :D
> modern CPU gains or loses more computer power from a 1° C temperature difference in the room's air
If you're using the boost algorithm rather than a static overclock, and when that boost is thermally limited rather than current limited. With a good cooler it's not too hard to always have thermal headroom.
> Easy line rate if you crank the MTU all the way to 9000 :D
In my experience jumbo frames provide at best an improvement of about 20% in rare cases, such as ping-pong UDP protocols such as TFTP or Citrix PVS streaming.
Which NICs would you recommend for me to buy for testing at least 1x100 Gbps (ideally 200 Gbps?) networking between this machine (PCIe 4.0) and an Intel Xeon one that I have with PCIe 3.0. Don't want to spend much money, so the cards don't need to be too enterprisey, just fast.
And - do such cards even allow direct "cross" connection without a switch in between?
Great article, I learned! Can you tell me if you looked into aspects of the NVMe device itself, such as whether it supports 4K logical blocks instead of 512B? Use `nvme id-ns` to read out the supported logical block formats.
Doesn't seem to support 4k out of the box? Some drives - like Intel Optane SSDs allow changing this in firmware (and reformatting) with a manufacturer's utility...
$ lsblk -t /dev/nvme0n1
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
nvme0n1 0 512 0 512 512 0 none 1023 128 0B
$ sudo nvme id-ns -H /dev/nvme0n1 | grep Size
LBA Format 0 : Metadata Size: 0 bytes - Data Size: 512 bytes - Relative Performance: 0 Best (in use)
Thanks for checking. SSD review sites never mention this important detail. For some reason the Samsung datacenter SSDs support 4K LBA format, and they are very similar to the retail SSDs which don't seem to. I have the a retail 970 Evo that only provides 512.
I just checked my logs, and none of Samsung's consumer NVMe drives have ever supported sector sizes other than 512B. They seem to view this feature as part of their product segmentation strategy.
Some consumer SSD vendors do enable 4kB LBA support. I've seen it supported on consumer drives from WD, SK hynix and a variety of brands using Phison or SMI SSD controllers (including Kingston, Seagate, Corsair, Sabrent). But I haven't systematically checked to see which brands consistently support it.
At least early WD Black models don't really seem to have 4K LBA support. The format option is listed, but it refuses to actually run the command to reformat the drive to the new "sector" size.
Put your system to sleep and wake it back up. (I use `rtcwake -m mem -s 10`). Power-cycling the drive like this resets whatever security lock the motherboard firmware enables on the drive during the boot process, allowing the drive to accept admin commands like NVMe format and ATA secure erase that would otherwise be rejected. Works on both the WD Black SN700 and SN750 models, doesn't seem to be necessary on the very first (Marvell-based) WD Black or the latest SN850.
I think that's the second-gen WD Black, but the first one that had their in-house SSD controller rather than a third-party controller. The marketing and packaging didn't prominently use a more specific model number to distinguish it from the previous WD Black, but on the drive's label it does say "PC SN700". Also, the first-gen WD Black was 256GB and 512GB capacities, while the later generations are 250/500/1000/2000GB. Firmware version strings for the first-gen WD Black were stuff like "B35200WD", while the SN700/720/730/750 family have versions like "102000WD" and "111110WD". So I would definitely expect your drive to require the sleep-wake cycle before it'll let you reformat to 4k sectors.
But this thread gets into details that are more esoteric than what I cover in most reviews, which are written with a more Windows-oriented audience in mind. Since I do most of my testing on Linux and have an excess of SSDs littering my office, I'm well-equipped to participate in a thread like this.
I highly recommend reddit.com/r/NewMaxx as the clearinghouse for consumer SSD news and Q&A. I'm not aware of a similarly comprehensive forum for enterprise storage, where this thread would probably be a better fit.
Regardless of what sector size you configure the SSD to expose, the drive's flash translation layer still manages logical to physical mappings at a 4kB granularity, the underlying media page size is usually on the order of 16kB, and the erase block size is several MB. So what ashift value you want to use depends very much on what kind of tradeoffs you're okay with in terms of different aspects of performance and write endurance/write amplification. But for most flash-based SSDs, there's no reason to set ashift to anything less than 12 (corresponding to 4kB blocks).
There are downsides to forcing the OS/FS to always use larger block sizes for IO. You might simply be moving some write amplification out of the SSD and into the filesystem, while losing some performance in the process. Which is why it really depends on your workload, and to some extent on the specific SSD in question. I'm not convinced that ashift=14 is a sensible one size fits all recommendation, even if we're talking only about recent-model consumer-grade NAND SSDs.
3) Why? As a performance troubleshooter consultant+trainer, I regularly have to reproduce complex problems that show up only under high concurrency & load - stuff that you can't just reproduce in a VM in a laptop.
4) Fun - seeing if the "next gen" hardware's promised performance is actually possible!
FYI I have some videos from my past complex problem troubleshooting adventures, mostly Oracle stuff so far and some Linux performance troubleshooting:
What I find interesting about the performance of this type of hardware is how it affects the software we are using for storage.
The article talked about how the Linux kernel just can't keep up, but what about databases or kv stores. Are the trade-offs those types of solutions make still valid for this type of hardware?
RocksDB, and LSM algorithms in general, seem to be designed with the assumption that random block I/O is slow. It appears that, for modern hardware, that assumption no longer holds, and the software only slows things down [0].
I have personally found that making even the most primitive efforts at single-writer principle and batching IO in your software can make many orders of magnitude difference.
Saturating an NVMe drive with a single x86 thread is trivial if you change how you play the game. Using async/await and yielding to the OS is not going to cut it anymore. Latency with these drives is measured in microseconds. You are better off doing microbatches of writes (10-1000 uS wide) and pushing these to disk with a single thread that monitors a queue in a busy wait loop (sort of like LMAX Disruptor but even more aggressive).
Thinking about high core count parts, sacrificing an entire thread to busy waiting so you can write your transactions to disk very quickly is not a terrible prospect anymore. This same ideology is also really useful for ultra-precise execution of future timed actions. Approaches in managed lanaguages like Task.Delay or even Thread.Sleep are insanely inaccurate by comparison. The humble while(true) loop is certainly not energy efficient, but it is very responsive and predictable as long as you dont ever yield. What's one core when you have 63 more to go around?
Isn't the use or non-use of async/await a bit orthogonal to the rest of this?
I'm not an expert in this area, but wouldn't it be just as lightweight to have your async workers pushing onto a queue, and then have your async writer only wake up when the queue is at a certain level to create the batched write? Either way, you won't be paying the OS context switching costs associated with blocking a write thread, which I think is most of what you're trying to get out of here.
Right, I agree. I'd go even further and say that async/await is a great fit for a modern asynchronous I/O stack (not read()/write()). Especially with io_uring using polled I/O (the worker thread is in the kernel, all the async runtime has to do is check for completion periodically), or with SPDK if you spin up your own I/O worker thread(s) like @benlwalker explained elsewhere in the thread.
Very interesting. I'm currently desiging and building a system which has a separate MCU just for timing accurate stuff rather than having the burdon of realtime kernel stuff, but I never considered just dedicating a core. Then I could also use that specifically to handle some IO queues too perhaps, so it could do double duty and not necessarily be wasteful. Thanks... now I need to go figure out why I either didn't consider that - or perhaps I did and discarded it for some reason beyond me right now. Hmm... thought provoking post of the day for me
The authors of the article I linked to earlier came to the same conclusions. And so did the SPDK folks. And the kernel community (or axboe :)) when coming up with io_uring.
I'm just hoping that we will see software catching up.
>Latency with these drives is measured in microseconds.
For context and to put numbers around this, the average read latency of the fastest, latest generation PCI 4.0 x4 U.2 enterprise drives is 82-86µs, and the average write latency is 11-16µs.
scylladb had a blogpost once about how surprisingly small amounts of cpu time are available to process packets at the modern highest speed networks like 40gbit and the like.
I can't find it now. I think they were trying to say that cassandra can't keep up because of the JVM overhead and you need to be close to metal for extreme performance.
This is similar. Huge amounts of flooding I/O from modern PCIx SSDs really closes the traditional gap between CPU and "disk".
The biggest limiter in cloud right now is the EBS/SAN. Sure you can use local storage in AWS if you don't mind it disappearing, but while gp3 is an improvement, it pales to stuff like this.
Also, this is fascinating:
"Take the write speeds with a grain of salt, as TLC & QLC cards have slower multi-bit writes into the main NAND area, but may have some DIMM memory for buffering writes and/or a “TurboWrite buffer” (as Samsung calls it) that uses part of the SSDs NAND as faster SLC storage. It’s done by issuing single-bit “SLC-like” writes into TLC area. So, once you’ve filled up the “SLC” TurboWrite buffer at 5000 MB/s, you’ll be bottlenecked by the TLC “main area” at 2000 MB/s (on the 1 TB disks)."
I didn't know controllers could swap between TLC/QLC and SLC.
Hi! From ScyllaDB here. There are a few things that help us really get the most out of hardware and network IO.
1. Async everywhere - We use AIO and io_uring to make sure that your inter-core communications are non-blocking.
2. Shard-per-core - It also helps if specific data is pinned to a specific CPU, so we partition on a per-core basis. Avoids cross-CPU traffic and, again, less blocking.
3. Schedulers - Yes, we have our own IO scheduler and CPU scheduler. We try to get every cycle out of a CPU. Java is very "slushy" and though you can tune a JVM it is never going to be as "tight" performance-wise.
4. Direct-attached NVMe > networked-attached block storage. I mean... yeah.
We're making Scylla even faster now, so you might want to check out our blogs on Project Circe:
Yes a number of articles about these newer TLC drives talk about it. The end result is that an empty drive is going to benchmark considerably different from one 99% full of uncompressable files.
Thanks for sharing this article - I found it very insightful. I've seen similar ideas being floated around before, and they often seem to focus on what software can be added on top of an already fairly complex solution (while LSM can appear to be conceptually simple, its implementations are anything but).
To me, what the original article shows is an opportunity to remove - not add.
Reminds me of the Solid-State Drive checkbox that VirtualBox has for any VM disks. Checking it will make sure that the VM hardware emulation doesn't wait for the filesystem journal to be written, which would normally be advisable with spinning disks.
If you think about it from the perspective of the authors of large-scale databases, linear access is still a lot cheaper than random access in a datacenter filesystem.
Plug for a post I wrote a few years ago demonstrating nearly the same result but using only a single CPU core: https://spdk.io/news/2019/05/06/nvme/
This is using SPDK to eliminate all of the overhead the author identified. The hardware is far more capable than most people expect, if the software would just get out of the way.
When I have more time again, I'll run fio with the SPDK plugin on my kit too. And would be interested in seeing what happens when doing 512B random I/Os?
The system that was tested there was PCIe bandwidth constrained because this was a few years ago. With your system, it'll get a bigger number - probably 14 or 15 million 4KiB IO per second per core.
But while SPDK does have an fio plug-in, unfortunately you won't see numbers like that with fio. There's way too much overhead in the tool itself. We can't get beyond 3 to 4 million with that. We rolled our own benchmarking tool in SPDK so we can actually measure the software we produce.
Since the core is CPU bound, 512B IO are going to net the same IO per second as 4k. The software overhead in SPDK is fixed per IO, regardless of size. You can also run more threads with SPDK than just one - it has no locks or cross thread communication so it scales linearly with additional threads. You can push systems to 80-100M IO per second if you have disks and bandwidth that can handle it.
Yeah, that’s what I wondered - I’m ok with using multiple cores, would I get even more IOPS when doing smaller I/Os. Is the benchmark suite you used part of the SPDK toolkit (and easy enough to run?)
Whether you get more IOPs with smaller I/Os depends on a number of things. Most drives these days are natively 4KiB blocks and are emulating 512B sectors for backward compatibility. This emulation means that 512B writes are often quite slow - probably slower than writing 4KiB (with 4KiB alignment). But 512B reads are typically very fast. On Optane drives this may not be true because the media works entirely differently - those may be able to do native 512B writes. Talk to the device vendor to get the real answer.
For at least reads, if you don't hit a CPU limit you'll get 8x more IOPS with 512B than you will with 4KiB with SPDK. It's more or less perfect scaling. There's some additional hardware overheads in the MMU and PCIe subsystems with 512B because you're sending more messages for the same bandwidth, but my experience has been that it is mostly negligible.
The benchmark builds to build/examples/perf and you can just run it with -h to get the help output. Random 4KiB reads at 32 QD to all available NVMe devices (all devices unbound from the kernel and rebound to vfio-pci) for 60 seconds would be something like:
perf -q 32 -o 4096 -w randread -t 60
You can specify only test specific devices with the -r parameter (by BUS:DEVICE:FUNCTION essentially). The tool can also benchmark kernel devices. Using -R will turn on io_uring (otherwise it uses libaio), and you simply list the block devices on the command line after the base options like this:
Yah this has been going on for a while. Before SPDK it was with custom kernel bypasses and fast inifiband/FC arrays. I was involved with a similar project in the early 2000's. Where at the time the bottleneck was the shared xeon bus, and then it moved to the PCIe bus with opterons/nehalem+. In our case we ended up spending a lot of time tuning the application to avoid cross socket communication as well since that could become a big deal (of course after careful card placement).
But SPDK has a problem you don't have with bypasses and uio_ring, in that it needs the IOMMU enabled, and that can itself become a bottleneck. There are also issues for some applications that want to use interrupts rather than poll everything.
Whats really nice about uio_ring is that it sort of standardizes a large part of what people were doing with bypasses.
Yeah, I used the formally incorrect GB in the title when I tried to make it look as simple as possible... GiB just didn't look as nice in the "marketing copy" :-)
I may have missed using the right unit in some other sections too. At least I hope that I've conveyed that there's a difference!
> For final tests, I even disabled the frequent gettimeofday system calls that are used for I/O latency measurement
I was knocking up some profiling code and measured the performance of gettimeofday as a proof-of-concept test.
The performance difference between running the test on my personal desktop Linux VM versus running it on a cloud instance Linux VM was quite interesting (cloud was worse)
I think I read somewhere that cloud instances cannot use the VDSO code path because your app may be moved to a different machine. My recollection of the reason is somewhat cloudy.
Anyone has advice on optimizing a windows 10 system? I have a haswell workstation (E5-1680 v3) that I find reasonably fast and works very well under Linux. In windows, I get lost. I tried to run the userbenchark suite which told me I'm below median for most of my components. Is there any good advice how to improve that? Which tools give good insight into what the machine is doing under windows?
I'd like first to try to optimize what I have, before upgrading to the new shiny :).
Have you checked if using the fio options (--iodepth_batch_*) to batch submissions helps? Fio doesn't do that by default, and I found that that can be a significant benefit.
Particularly submitting multiple up requests can amortize the cost of setting the nvme doorbell (the expensive part as far as I understand it) across multiple requests.
I tested various fio options, but didn't notice this one - I'll check it out! It might explain why I still kept seeing lots of interrupts raised even though I had enabled the I/O completion polling instead, with io_uring's --hipri option.
edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...
I think on recent kernels, using the hipri option doesn't get you interrupt-free polled IO unless you've configured the nvme driver to allocate some queues specifically for polled IO. Since these Samsung drives support 128 queues and you're only using a 16C/32T processor, you have more than enough for each drive to have one poll queue and one regular IO queue allocated to each (virtual) CPU core.
It's terribly documented :(. You need to set the nvme.poll_queues to the number of queues you want, before the disks are attached. I.e. either at boot, or you need to set the parameter and then cause the NVMe to be rescanned (you can do that in sysfs, but I can't immediately recall the steps with high confidence).
> I tested various fio options, but didn't notice this one - I'll check it out! It might explain why I still kept seeing lots of interrupts raised even though I had enabled the I/O completion polling instead, with io_uring's --hipri option.
I think that should be independent.
> edit: I ran a quick test with various IO batch sizes and it didn't make a difference - I guess because thanks to using io_uring, my bottleneck is not in IO submission, but deeper in the block IO stack...
It probably won't get you drastically higher speeds in an isolated test - but it should help reduce CPU overhead. E.g. on one of my SSDs
fio --ioengine io_uring --rw randread --filesize 50GB --invalidate=0 --name=test --direct=1 --bs=4k --numjobs=1 --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 48
uses about 25% more CPU than when I add --iodepth_batch_submit=0 --iodepth_batch_complete_max=0. But the resulting iops are nearly the same as long as there are enough cycles available.
This is via filesystem, so ymmv, but the mechanism should be mostly independent.
Nice follow up @ttanelpoder to "RAM is the new disk" (2015)[1] which we talked about not even two weeks ago!
I was quite surprised to hear in that thread that AMD's infiniband was so oversubscribed. There's 256GBps of pcie on a 1P butit seems like this 66GBps is all the fabric can do. A little under a 4:1 oversubscription!
I'd been going off this link[1] from the previous "Ram is the new diskx thread, but I think last time I read it I'd only counted one Infinity Fabric Inter-Socket on the 1P diagram (which provides the PCIe). On review, willing to bet, yeah, the pcie lanes aren't all sharing the one IFIS. The diagram is to give an idea, not the actual configuration.
I love articles like these, taking a deep dive into achieving the absolute best on neglected metrics like IO. I am trying to get very high resolution (~ 2 * 10^6 samples/s) voltage and current measurements of a sensor for a control system. Has anyone tried that? Should it be done through PCIe?
When I bought a bunch of NVME drives, I was disappointed with how slow the maximum speed I could achieve with them was given my knowledge and available time at the time. Thanks for making this post to give me more points of insight into the problem.
I'm on the same page with your thesis that "hardware is fast and clusters are usually overkill," and disk I/O was a piece that I hadn't really figured out yet despite making great strides in the software engineering side of things. I'm trying to make a startup this year and disk I/O will actually be a huge factor in how far I can scale without bursting costs for my application. Good stuff!
> Shouldn’t I be building a 50-node cluster in the cloud “for scalability”? This is exactly the point of my experiment - do you really want to have all the complexity of clusters or performance implications of remote storage if you can run your I/O heavy workload on just one server with local NVMe storage?
Anyone have a story to share about their company doing just this? "Scale out" has basically been the only acceptable answer across most of my career. Not to mention High Availability.
You can get high availability without a "distibuted system", just an active/passive failover cluster may be enough for some requirements. Even failover (sometimes seamless) on a VMWare cluster can help with planned maintenance scenarios without downtime, etc.
Another way of achieving HA together with satisfying disaster recovery requirements is replication (either app level or database log replication, etc). So, no distributed system is necessary unless, you have legit scaling requirements.
If you work on ERP-like databases for traditional Fortune 500-like companies, few people run such "sacred monolith" applications on modern distributed NoSQL databases, it's all Oracle, MSSQL or some Postgres nowadays. Data warehouses used to be all Oracle, Teradata too - although these DBs support some cluster scale-out, they're still "sacred monoliths" from a different era (they are still doing - what they were designed for - very well). Now of course Snowflake, BigQuery, etc are taking over the DW/analytics world for new greenfield projects, existing systems usually stay as they are due to lock-in & extremely high cost of rewriting decades of existing reports and apps.
U.2 form factor drives (also NVMe protocol) can achieve higher IOPS (particularly writes) still over M.2 form factor (especially M.2 2280), with higher durability, but you'll need your own controllers which are sparse on the market for the moment. Throughput (MB/sec, not IOPS) will be about the same, but the U.2 drives can do it for longer.
U.2 means more NAND to parallelize over, more spare area (and higher overall durability), potentially larger DRAM caches, and a far larger area to dissipate heat. Plus it has all the fancy bleeding-edge features you aren't going to see on consumer-grade drives.
-- -----
The big issue with U.2 for "end user" applications like workstations is you can't get drivers from Samsung for things like the PM1733 or PM9A3 (which blow the doors off the 980 Pro, especially for writes and $/GB, plus other neat features like Fail-In-Place) unless you're an SI, in which you also co-developed the firmware. The same goes for SanDisk, KIOXIA and other makers of enterprise SSDs.
The kicker is enterprise U.2 drives are about the same $/GB as SATA drives, but being NVMe PCIe 4.0 x4. blow the doors off about everything. There's also the EDSFF, NF1 and now E.1L form factors, but U.2 is very prevalent. Enterprise SSDs are attractive as that's where the huge volume is (hence the low $/GB), but end-user support is really limited. You can use "generic drivers", but you won't see anywhere near the peak performance of the drives.
The good news is both Micron and Intel have great support for end-users, where you can get optimized drivers and updated firmware. Intel has the D7-P5510 probably hitting VARs and some retail sellers (maybe NewEgg) within about 60 days. Similar throughput to the Samsung drives, far more write IOPS (especially sustained), lower latencies, FAR more durability (with a big warranty), far more capacity, and not too bad a price (looking like ~$800USD for 3.84TB with ~7.2PB of warrantied writes over 5 years).
-- -----
My plan once Genesis Peak (Threadripper 5XXX) hits is four 3.84TB Intel D7-P5510s in RAID10, connected to a HighPoint SSD7580 PCIe 4.0 x16 controller. Figure ~$4,000 for a storage setup of ~7.3TB usable space after formatting, 26GB/sec peak writes, ~8GB/sec peak reads, with 2.8M 4K read iops, 700K 4K write iops, and ~14.3PB of warrantied write durability.
How would a model-specific driver for something that speaks NVMe even work? Is it for Linux? Is it open? Is it just modifications to the stock Linux NVMe driver that take some drive specifics into account? Or is it some stupid proprietary NVMe stack?
I think he may have meant you can't get the drives, not the drivers. Samsung, Kioxia, etc. enterprise NVMe SSDs work fine with standard Linux NVMe drivers and I don't think they offer custom NVMe drivers except possibly for Windows. The problem is that their enterprise drives mostly aren't sold at retail. If you aren't buying them as part of a big B2B deal, you simply can't acquire the hardware.
It's been less than a quarter century ago, 1997, when Microsoft and Compaq launched the TerraServer which was a wordplay on terabyte -- it stored a terabyte of data and it was a Big Deal. Today's that not storage, that's main RAM, unencumbered by NUMA.
Great article. Did you consider doing Optane tests? I built a 3990x WS with all-optanes and I get blazing fast access times, but 3gb/s top speeds. It might be interesting to look at them for these tests, specially in time-sensitive scenarios.
I have 2 Optane 905P M.2 cards and I intend to run some database engine tests, putting their transaction logs (and possibly temporary spill areas for sorts, hashes) on Optane.
When I think about Optane, I think about optimizing for low latency where it's needed and not that much about bandwidth of large ops.
Lovely article, zero fluff, tons of good content and modest to boot. Thank you for this write-up, I'll pass it around to some people who feel that the need for competent system administration skills has passed.
I wonder, is increasing temperature of the M.2 NVMe disks affecting the measured performance? Or is P620 cooling system efficient enough to keep temp of the number of disks low?
Both quad SSDs adapters had a fan on it and the built in M.2 ones had a heatsink, right in front of one large chassis fan & air intake. I didn't measure the SSD temperatures, but the I/O rate didn't drop over time. I was bottlenecked by CPU when doing small I/O tests, I monitored the current MHz from /proc/cpuinfo to make sure that the CPU speeds didn't drop lower than their nominal 3.9 GHz (and they didn't).
Btw, even the DIMMs have dedicated fans and enclosure (one per 4 DIMMs) on the P620.
The ASUS one doesn't have its own RAID controller nor PCIe switch onboard. It relies on the motherboard-provided PCIe bifurcation and if using hardware RAID, it'd use AMD's built-in RAID solution (but I'll use software RAID via Linux dm/md). The HighPoint SSD7500 seems to have a proprietary RAID controller built in to it and some management/monitoring features too (it's the "somewhat enterprisey" version)
The HighPoint card doesn't have a hardware RAID controller, just a PCIe switch and an option ROM providing boot support for their software RAID.
PCIe switch chips were affordable in the PCIe 2.0 era when multi-GPU gaming setups were popular, but Broadcom decided to price them out of the consumer market for PCIe 3 and later.
This article focuses on IOPS and throughput, but what is also important for many applications is I/O latency, which can be measured with ioping (apt-get install ioping). Unfortunately, even 10x PCIe 4.0 NVMe do not provide any better latency than a single NVMe drive. If you are constrained by disk latency then 11M IOPS won't gain you much.
> Does this come up in practice? What kind of use cases suffer from disk latency?
One popular example is HFT.
And from my experience on a desktop PC it is better to disable swap and have the OOM killer do his work, instead of swapping to disk, which makes my system noticeable laggy, even with a fast NVMe.
I'm somewhat curious what happens to the long standing 4P/4U servers from companies like Dell and HP. The Ryzen/EPYC has really made going past 2P/2U a more rare need.
Indeed, 128 EPYC cores in 2 sockets (with total 16 memory channels) will give a lot of power. I guess it's worth mentioning that the 64-core chips have much lower clock rate than 16/32 core ones though. And with some expensive software that's licensed by CPU core (Oracle), you'd want faster cores, but possibly pay a higher NUMA price when going with a single 4 or 8 sockets machine for your "sacred monolith".
At least when I was actively looking at hardware (2011-2018), 4 socket Xeon was available off the shelf, but at quite the premium over 2 socket Xeon. If your load scaled horizontally, it still made sense to get a 2P Xeon over 2x 1P Xeon, but 2x 2P Xeon was way more cost efficient than a 4P Xeon. 8P or 16P seemed to exist, but maybe only in catalogs.
I'm not really in the market anymore, but Epyc looks like 1P is going to solve a lot of needs, and 2P will be available at a reasonable premium, but 4P will probably be out of reach.
There always seems to be buyers for more exotic high end hardware. That market has been shrinking and expanding, well since the first computer, as mainstream machines become more capable and people discover more uses for large coherent machines.
But users of 16 socket machines, will just step down to 4 socket epyc machines with 512 cores (or whatever). And someone else will realize that moving their "web scale" cluster from 5k machines, down to a single machine with 16 sockets results in lower latency and less cost. (or whatever).
Would love to see some very dense blade style ryzen offerings. The 4 2P nodes in 2U is great. Good way to share some power supies, fan, chassis, ideally multi-home nic too.
Turn those sleds into blades though, put em on their side, & go even denser. It should be a way to save costs, but density alas is a huge upsell, even though it should be a way to scale costs down.
You might be able to buy a smaller server but the rack density doesnt necessarily change. You still have to worry about cooling and power so lots of DCs would have 1/4 or 1/2 racks.
Sure. I wasn't really thinking of density, just the interesting start of the "death" of 4 socket servers. Being an old-timer, it's interesting to me because "typical database server" has been synonymous with 4P/4U for a long, long time.
I've been thinking about this. Would traditional co-location (e.g. 2x 2U from DELL) in a local data center be cheaper if e.g. you're serving local (country-wise) market?
Depends on how long you need the server, and the ownership model you've chosen to pursue for it.
If you purchase a server and stick it in a co-lo somewhere, and your business plans to exist for 10+ years — well, is that server still going to be powering your business 10 years from now? Or will you have moved its workloads to something newer? If so, you'll probably want to decommission and sell the server at some point. The time required to deal with that might not be worth the labor costs of your highly-paid engineers. Which means you might not actually end up re-capturing the depreciated value of the server, but instead will just let it rot on the shelf, or dispose of it as e-waste.
Hardware leasing is a lot simpler. Whether you lease servers from an OEM like Dell, there's a quick, well-known path to getting the EOLed hardware shipped back to Dell and the depreciated value paid back out to you.
And, of course, hardware renting is simpler still. Renting the hardware of the co-lo (i.e. "bare-metal unmanaged server" hosting plans) means never having to worry about the CapEx of the hardware in the first place. You just walk away at the end of your term. But, of course, that's when you start paying premiums on top of the hardware.
Renting VMs, then, is like renting hardware on a micro-scale; you never have to think about what you're running on, as — presuming your workload isn't welded to particular machine features like GPUs or local SSDs — you'll tend to automatically get migrated to newer hypervisor hardware generations as they become available.
When you work it out in terms of "ten years of ops-staff labor costs of dealing with generational migrations and sell-offs" vs. "ten years of premiums charged by hosting rentiers", the pricing is surprisingly comparable. (In fact, this is basically the math hosting providers use to figure out what they can charge without scaring away their large enterprise customers, who are fully capable of taking a better deal if there is one.)
> If you purchase a server and stick it in a co-lo somewhere, and your business plans to exist for 10+ years — well, is that server still going to be powering your business 10 years from now? Or will you have moved its workloads to something newer?
Which, if you have even the remotest fiscal competence, you'll have funded by using the depreciation of the book value of the asset after 3 years.
Linux pagecache doesn't use hugepages, but definitely when doing direct I/O into application buffers, would make sense to use hugepages for that. I plan to run tests on various database engines next - and many of them support using hugepages (for shared memory areas at least).
I want to build an epic rig that will last a long time with professional grade hardware (with ECC memory for instance) and would love to get a lot of the bleeding-edge stuff without compromising on durability. Where do these people hang out online?