Eric Brandwine (VP/DE @ AWS) said publicly in 2019 that EC2 had never scheduled different tenants on the same physical core at the same time, even before we learned about things like MDS.
How does that work for burstable VMs? Surely they must be oversubscribed on a per-core basis? Are they sharing cores, but doing scheduling at a very fine granularity (eg. millisecond)?
I don't think it really matters, the goal post will move to LLC attacks. As long as you have caching involved, I wouldn't bet anything can run "safely".
I wonder if per-tenant cache structures might become more popular in the future. It isn't unimaginable that different tenants will be running different things anyway, so split caches might still keep a reasonable amount of effectiveness.
That's a really good talk. Wish I'd watched it around the time the speculative execution vulnerabilities were all the talk.
I'm not qualified in computer science or programming but I could easily follow along with his explanation about challenging CPU architectural concepts. Gives me a great deal of confidence that AWS not only know what they're doing, but can communicate it well to management.
ReInvent 2019 was a magical place. Watching many of the videos from that conference helped my career a lot. The quality for the most part of the content is outstanding.
I've been thinking along similar lines, and also wondering what percentage of workloads still see significant benefits from SMT. It made a lot more sense in 2002 when it was still common to find servers with only one core but now even a phone has half a dozen cores and laptops are pushing into the tens of cores range while servers are pushing an order of magnitude greater.
I remember disabling SMT in the mid-2000s because it was a performance degradation for some memory-sensitive code one of my users was running. I'd be really curious what percentage of workloads are not limited by I/O but still stay below the level where resource contention between threads lowers performance.
DDR4 RAM has the same latency as DDR2 RAM. DDR5 will probably have the same latency as well.
In 2002, our processors were so thin they could only execute maybe 2-instructions per clock tick. Today, we've got CPUs that hit 4 (x86 L1 cache), 6 (x86 uop cache), or 8 (Apple M1) instructions per tick.
With many problems still memory-latency bound (aka: any Java code going through blah->foo()->bar()... look at all those pointer indirections!!), it makes sense to "find other work to do" while the CPU stalls on memory latency.
The RAM supports higher bandwidth. The CPU supports more instructions-per-tick. Spending effort to find work to do just makes sense.
Oh, I'm aware of that concern but I was wondering how many benchmarks show a significant win. After you consider the impacts on the processor caches, and the fraction of apps which are performance sensitive but still end up limited on cache-unfriendly random memory access, it seems like it could be an increasingly niche concern where most of the code people run either isn't performance sensitive or has gotten better about memory access patterns after decades of hardware trending in that direction.
Apple's highly-competitive performance with processors which don't support SMT suggests that this is not such a clear win, especially given the non-trivial cost of the security mitigations.
Thanks - that’s the kind of spread I was curious about since there are some other factors like the cost of Spectre mitigations which have added addition dimensions to comparisons.
It is very rare for programs to be memory bandwidth bound. It usually take a lot of optimization just to get to that point as well as some disregard for memory bandwidth on top of that (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
The vast majority of what people run is memory latency bound and in those cases using extra threads makes sense so that the explicit parallelism can compensate for memory latency.
> (such as looping through large arrays, only doing one simple calculation to each index, then doing that on many cores).
...which perfectly describes a parallelized mat-vec-mult. Yes, that's not common in most applications, but I'd have a hard time naming a more basic operation in scientific (and related) computations.
We are saying the same thing here, though I think you are missing the point that this is all a response to someone asking if SMT is useful anymore since there are many cores in almost every CPU.
The answer is that it is absolutely still useful since your example is niche and most software/systems can still benefit from being able to work around memory latency with more threads.
There are different ways to do that. Apple put efforts into decoders allowing to increase single-threaded performance. Intel realistically cannot do that due to limitations of x86 instruction format. So they worked-around that via hyper threading that allowed to decode more in parallel.
It has decreased, but let's say halving a timing within 20-25 years is not exactly the kind of progress people intuitively associate with semiconductors.
What has decreased quite a lot though is the time to transfer, which is of course stacked on top of CL. CPUs always have to fetch a full cache line (usually 64 bytes), and the time to get 64 bytes out of memory has more than halved each generation.
I would say most workloads benefit from SMT. I've done work disabling and enabling it and rerunning benchmarks to understand the impact.
As a gross over generalization, disabling it can improve latency for workloads that are compute intensive. This is not most workloads, and almost everything will see some improvement and increase in throughput by having SMT turned on at lower utilizations.
As you push the workload beyond 60-70% utilization the wins from SMT fall away, but generally are not impacting you negatively. What happens is your processor is over performing at lower utilizations, and your performance curve bends to where it would have been without SMT at lower utilizations.
The only workload I've done lately that buried all the cores was a significantly parallel CSV import into clickhouse from compressed zips of multiple CSVs, while filtering the CSVs with python. Running nproc=cores was 5% slower than nproc=cores+smt in overall throughput.
I disabled SMT in 2007 after reading colin percivals thoughts on intels hyperthreading. I have disabled it on every computer I have had since, since the fixes didn't inspire much confidence back then.
I have done benchmarks from time to time and it doesn't really matter much for my workloads.
If you are trying to prevent attacker code communicating with another piece of attacker code on another core: No. It isn't secure. There are a lot of side channels.
If you are trying to prevent attacker code figuring out what code is running on another core: No. Patterns of L3 cache eviction and memory bus contention will probably leak sufficient information to have a good guess what another core is up to.
If you are trying to prevent attacker code figuring out data in another cores memory or registers: Probably, yes. As long as the core is doing something with no data dependent branches, or instructions with data dependent timings, it should be safe.
Very little code is written to have no data dependent branches, or instructions with data dependent timings. Some well written cryptography code has that, but thats about it.
> Very little code is written to have no data dependent branches, or instructions with data dependent timings
Ooh I have some idea on how to use data dependent branches in your own code to leak critical information through side channels (Spectre) but how do you leak information from others' data dependent branches and instructions?
Imagine code you want to attack has an array lookup based on some secret data:
x = array[secretdata].
The entry in the array will be loaded into the cpu core via the L3 cache. If the L3 cache is already full, then one entry will have to be evicted from the cache. The evicted entry will depend on the memory address that is being accessed.
Another CPU, running an attackers code, can repeatedly try to load stuff from L3, and detect what was evicted and when to learn the secretdata.
I'm a simple guy, I trust the kernel and its devs. If kernel says to me via /sys/ that on this CPU I have to disable SMT if I want to get rid of "Vulnerable" messages, I do so.
Kernel considers recent Intel CPUs not vulnerable even with SMT, it's good enough for me
Actually, that's a common misconception. When "jump_off_a_cliff_ratio" is set to a value other than "-1", then "jump_off_a_cliff" is ignored. The ratio is divided by 20000, so if you set it to 10000 your system will only jump off a cliff if 50% of friends do so.
The default setting is "15000", but some distros change it (e.g. Arch sets it to "0" so that your system always jumps first--more performant, but less safe).
Conveniently, none of this is documented anywhere, all the other information you can find online is years out of date, and most sysadmins just accept the risk of cliff-jumping rather than trying to tune it manually.
This is a red herring. Where are all the people that are unsure of the security of virtualization? There is some Xen vulnerability every year that is the equivalent of the house burning down, but where is the Azure support article that suggests disabling Xen? (Oh right, you can't.)
I mean, the article is totally right of course, you shouldn't run untrusted code - THAT INCLUDES THE FREAKING HYPERVISOR.
I think you inadvertently hit on why this is a good discussion to have: security is a concern across the entire stack and attackers have many points where they can successfully compromise your system. It's much harder to replace Xen than it is to disable SMT, so it's much more feasible for someone to consider the latter since it's a single configuration point with very little chance of side-effects other than cost whereas touching Xen would have massive ripple effects for ops & potentially compatibility.
> so it's much more feasible for someone to consider the latter
And conversely it's much more beneficial for an attacker to consider the former (attacking the hypervisor), this exacerbates the situation by making red team more aggressive to one side whilst blue team is busy debating if they should disable SMT .
I mean, that's always possible but it was always the case that the blue team needs to worry about the entire stack and one of those two choices has several orders of magnitude more work involved. The blue team could disable SMT and switch back to debating Xen with almost no discernible impact on the time needed for the latter and an entire branch of the threat tree removed.
There's definitely a bias in security toward "sexy" vulnerabilities that are novel and creative but hard to exploit in practice vs boring "forgot to bounds check" or "off by one logic error" type vulnerabilities that are pervasive.
SMT (a.k.a. hyperthreading) is kind of a security minefield, but so is out of order execution and crufty instruction sets like x86_64. But vulnerabilities of this sort tend to be hard to exploit in the wild. Not saying you shouldn't be concerned about them, but how concerned you are should depend on your threat model. If I were really paranoid and could pick I'd choose something like a low-power simple ARM or RISC-V core to do critical computation. Alternatively you could run your secure compute on any system as long as nobody is allowed to run any other code on that system or have physical access to it.
BTW I am deeply surprised that there has never been a cloud hypervisor doomsday vulnerability that has brought down AWS or some other huge provider. Back when virtualization in the cloud got big I would have bet serious money that this would have happened by now.
To quote James Mickens, "large swathes of the security community are fixated on avant-garde horrors such as the fact that, during solar eclipses, pacemakers can be remotely controlled with a garage door opener and a Pringles can."
> BTW I am deeply surprised that there has never been a cloud hypervisor doomsday vulnerability that has brought down AWS or some other huge provider. Back when virtualization in the cloud got big I would have bet serious money that this would have happened by now.
Meh, the vast majority of rational hackers will just sell a zero day capable of doing that either to the vendor themselves or to a nation state for several million dollars. I suppose they could bring down AWS for "fun", or try to short Amazon's stock and make a personal profit, but the consequences of getting caught are devastating. Selling the vuln is practically risk free and nets life changing money.
>BTW I am deeply surprised that there has never been a cloud hypervisor doomsday vulnerability that has brought down AWS or some other huge provider. Back when virtualization in the cloud got big I would have bet serious money that this would have happened by now.
I think at this point anyone that's capable of launching such an attack would much rather it be as transparent as possible as they'd have far, far more to gain from not disrupting services. Aside from ransomware attacks, I think the general trend has always been towards those type of attacks as they're way more likely to be profitable
> SMT (a.k.a. hyperthreading) is kind of a security minefield, but so is out of order execution and crufty instruction sets like x86_64. But vulnerabilities of this sort tend to be hard to exploit in the wild.
I'm not sure this is even true, so much as that it's just easier to do it some other way.
You could use speculative execution by measuring timing to extract secrets to get privilege escalation, but then you have to sit there with the CPU at 100% for ten minutes while doing timing statistics. Whereas you pull the latest 0-day off a list somewhere, and the target is going to install the patch tomorrow but hasn't yet, and that pops the system in less than a second. So people do that instead.
> BTW I am deeply surprised that there has never been a cloud hypervisor doomsday vulnerability that has brought down AWS or some other huge provider. Back when virtualization in the cloud got big I would have bet serious money that this would have happened by now.
The assumption there is that the goal of somebody who got in would be an externally visible denial of service, as opposed to e.g. data exfiltration which might never be detected.
> You could use speculative execution by measuring timing to extract secrets to get privilege escalation, but then you have to sit there with the CPU at 100% for ten minutes while doing timing statistics.
Code wins arguments:
1. provide the exploit code that does it i.e. I go to a website and it reads a content that I have in a different browser tab.
2. provide the exploit code that while running on one VM on a host that does not have mitigations enabled reads a value of a file in a different VM.
The VM ones don't really have anything to do with it being a VM. Being a separate VM just doesn't save you. The VM is still a thread running on the same core as the attacker.
The "If we play very hard with user browser we can sometimes get something out of user browser. We do not know what that something is, we need to know what kind of a system user has, we should profile profile it, etc." is not impressive as a demo. I cannot make it work under a Debian 10, Chromium and i4970K and i7-6500U. This not not Elon Musk throwing a brick and brick breaking a windshield. It is not having bricks to throw.
The demo is "Pull up website A into a tab 1. Go to website B in a tab 2. Website B says 'You have website A' opened". That's a demo.
> The VM ones don't really have anything to do with it being a VM. Being a separate VM just doesn't save you.
If that's the case Google should be able to develop an demo attack where content of a file in a VM A be readable in a VM B in a jiffy.
These are timing attacks. They're heavily dependent on the specifics of the target. The code has to know what it's measuring in minute detail, because the timing is affected by nearly everything. The exact instructions being executed by the target program, the size and associativity of the processor caches, everything.
That doesn't mean you can't do it. It does mean you can't create a generic exploit that works against all software and all processors, instead of one targeting a specific application on a specific model of CPU. Then people say "it doesn't work for me" because they're using different code or hardware and the code has to be tailored for that, which is work, which nobody is going to do for free to appease hecklers.
> If that's the case Google should be able to develop an demo attack where content of a file in a VM A be readable in a VM B in a jiffy.
In a VM doing what? It can't just be idle. It has to be executing some code whose timing you can measure to extract its secrets, and the secrets in the address space of that process have to be useful in order to get the file, e.g. a password that can be used to sign into the VM. Then the exploit has to be tailored to that software and hardware.
The fact that nobody is willing to do this over a message board post is not proof that nobody can do it if there was a few thousand dollars in it for them. These exploits are not the low hanging fruit; that isn't the same thing as being impossible.
> These are timing attacks. They're heavily dependent on the specifics of the target. The code has to know what it's measuring in minute detail, because the timing is affected by nearly everything. The exact instructions being executed by the target program, the size and associativity of the processor caches, everything.
So they are absolutely positively irrelevant in the real world.
> In a VM doing what? It can't just be idle. It has to be executing some code whose timing you can measure to extract its secrets, and the secrets in the address space of that process have to be useful in order to get the file, e.g. a password that can be used to sign into the VM. Then the exploit has to be tailored to that software and hardware.
So again, they are absolutely irrelevant in the real world.
Security industry became an industry of chicken littles. 99.9999% of modern attacks are attacks that were executed successfully because someone was running the code that had sql injections, direct variable substitutions before doing sytem, lack of user input sanitation and validation, or pulling yet another multi-gigabyte unaudited pile of junk dependencies a-la left-pad.js, not because someone mounted a timing attack. Except that dealing with those issues is not sexy so instead we are getting everyone freaking out about some internet villains doing something ( that no one else can do ) to CPUs of a random Joe ( the villains knows everything about Joe's machine and processes running on it and can even control it ) to mount a sophisticated attack, never mind that he can just make Joe install a piece of code that will run with elevated privileges for 0.0001% of the effort.
This is why execs are considering security people to be snake oil salesmen.
As expected from the "security" experts -- it is all chicken little "Sky is falling" talk and zero demonstrations.
This is why pretty much no one takes it seriously : it used to be that the fancy vulnerabilities came with a demo:
$ ./some-exploit-code
#
some-exploit-code won the argument. No one could argue that it did not matter as anyone who managed to get code executed on a target server got #. These days we just get lots of "OMG! We are all going to get pwned!" handwaving and zero demonstration that can be done in front of execs.
> BTW I am deeply surprised that there has never been a cloud hypervisor doomsday vulnerability that has brought down AWS or some other huge provider.
There have been multiple cases of significant vulnerabilities where cloud providers had to perform patching and restart their services. No one, to my knowledge, has exploited those maliciously, but the vulns have had impact.
This would be an issue where no virtualisation was done at all. In perhaps some sort of critical situation where the rsk of virtualisation was unacceptable. So the subject of virtualisation is, well, a red herring.
Recent Linux patches allow the scheduler to only schedule processes with the same seed on the same core. The “seed” would be something tied to the user, or something else… I don’t think they’re merged yet… but that might be a solution.
>> We're planning to get some high core count machines to be new compute machines in our environment of general multi-user Unix login servers
I think the "general multi-user Ubix login" thing assumes different users on the same CPU. Unless they're actually going to have more CPUs than users. It seems like sharing a CPU is actually less secure than sharing one via SMT because of all those SMT-specific exploits (in addition to whatever exploits might be possible due to CPU sharing).
Sharing an SMT is IMHO less secure, because the adversary can probe while the victim is running. Sharing a CPU means you can only probe the effects.
It would make sense to me for a traditional multiuser shell server to have a scheduling policy prohibiting SMT sharing between users. As long as the scheduler work is not expensive, it would seem almost no-cost vs disabling SMT in case all users only run a single thread, but you can still get whatever benefits SMT provides within a user's workload.
And... can each browser tab run as a different user? (or more likely, there's a way to have the browser opt in to some security policy that would effectively do the same thing).
I haven't followed it or tried it, but supposedly you can boot up a pure open source Darwin system. Certainly it has been very useful for me as a cross-platform developer to be able to browse its source and understand quirks of system calls etc. Are you saying that key parts you'd like to review are closed source? Hmm, the new file system is closed source, which is a clue, but I don't know to what extent Darwin is "different" from what macOS is really running...
I haven’t kept up with this too much lately, but to my knowledge there is very little documentation or working code to set up a Darwin environment from scratch with FOSS, and there aren’t any modern Darwin “distributions” because of this difficulty and lack of documentation and code.
I wish this weren’t the case, but whatever PureDarwin has right now is probably about the closest you can get to a Darwin environment that boots without nonfree code.
Leaving SMT enabled is most likely what you want to do. It's more performance, and the attack vector on a desktop machine for this is incredibly small. You'd need to run untrusted code (not hard, JavaScript is everywhere after all) but that JS code would need to have something to attack on a neighboring thread (unlikely) and know to attack that neighboring thread (even more unlikely) and not get mitigated by any of the browser spectre mitigations that have been in place for a while now (again, unlikely)
same. If someone wants to breach me they are going to use a zero day anyway, or use the days between zero and when I apply the patch. Or breach my windows box, or intercept a package or something. SIM swap, etc.
Now if only I could figure out how to disable Windows Defender reliably.
These days I mostly have no idea what the hell any of my equipment is running. There is a whole industry that popped up to write tools to work out what the hell you are running and work out how screwed you are. Doesn't feel healthy to me.
Well... aren't these attacks only a concern if you have adversarial code running on adjacent cores? So for a desktop user the threat model really doesn't change. Just do your best to keep malicious programs off of your machine. If someone can already run a speculative execution attack against you're local machine, there are much easier attacks to run.
How closely have you followed the layers of mitigations which the major browser engines have implemented? I think a more interesting form of this question is asking how often people are exposed to malicious code which isn't covered by the browser sandboxes.
Browser vendors have been pretty quick to add mitigations, disable attack surface, etc. Someone comes up with a new way to attack systems via JS, but vendors are all headed in a safer direction.
People were doing POCs before the browser makers nerfed the hell out of timing APIs especially from web assembly and autoupdate distributed the nerfs out.
Browser manufacturers changed the behavior of APIs like performance.now() [0] to add slight rounding 'errors' or limit the granularity of timing functions. That function used to report on the microsecond level in (most) browsers, but now is generally limited to around 1ms due to timing attacks.
Yes, like the SharedArrayBuffer that browsers disabled. There are probably roughly infinite ways to construct clocks. Closing the obvious leaks has been a stop gap while other mitigation techniques are rolling out, like site isolation.
'Nerf' is a term used for when game developers update a game and reduce the effectiveness of something. 'My sword attack ability was nerfed in the latest patch'. In this context I suppose the API capabilities was reduced in browser updates.
If these servers are "general multi-user environments", are you sure you're really getting any benefit at all out of SMT? It's never a magical 200% performance silver bullet, not even for highly specialized workloads.
Goal of SMT isn't double the performance, it having >1 threads better using execution units within a CPU core.
The example frequently given is running an integer heavy and floating point heavy code at the same time - they don't use the same execution units, so they can better utilize available resources.
In the real world it's more likely that one thread is waiting on memory access or some other part of the system and the other thread can proceed during that wait.
"(For reasons beyond the scope of this entry, we can assume that SMT is worthwhile for us.)"
This is also coming from utoronto.ca - so... I guess the question I would have is actually more along the lines of whether there are actually adverse effects from exploiting SMT to snoop on other processes?
Is this going to be a whole lot of engineering students doing finite analysis homework? If so, what do you get out of an SMT exploit? There are probably a lot easier ways to cheat on your homework.
Are they doing analysis of medical data that hasn't yet been anonymized? Perhaps that's a bigger deal. Maybe.
1. There is no such thing as perfect security
2. Security is an economic decision. You have to know what you stand to lose before you decide if buying 2x the number of processors and disabling SMT is worth it.
I've found SMT to give 20-25 percent performance improvement on parallel workloads using a lot of the same data, so threads shouldn't be polluting each others cache. Different users would obviously not get the same improvement.
I'm not sure 20-25 percent is even worth having SMT given that is' like going from 4 cores to 5.
> I'm not sure 20-25 percent is even worth having SMT given that is' like going from 4 cores to 5.
Can you think of another 25% bonus that costs nothing in terms of chip area? SMT doesn't add any new registers (the register files and retirement buffers are the same). SMT doesn't add cache, or pipelines, or anything.
SMT just requires the chip's resources to be partitioned up in some cases, so its a complication to add to a chip but doesn't seem to take up much area at all.
Normally to get a 25% bonus, you'd have to double the L2 cache or something similar (and as the 3x L3 cache tests from Microsoft proves for Milan-X, even 300% L3 cache isn't anywhere close to 300% faster, despite taking up significantly more die area... albeit cheap die area in the form of 3d-stacked chiplets but still, a lot more silicon is being thrown at the problem there... compared to nearly-zero-silicon in the case of SMT)
AMD Zen 2 has 180 registers in its 64-bit register file (an additional set of registers, also nearly 200ish, is also for the 256-bit vector YMM registers). The 16 architectural registers (RAX, RBX, etc. etc.) are mapped to these 180 registers for out-of-order execution.
SMT is simply two threads splitting this 180-register file between each other.
---------
That's how out-of-order execution is accomplished. The "older registers" are remembered so that future code can be executed.
In practice, instructions like "xor rax, rax" are implemented as "malloc-new-register from the file, then set new register to zero".
As such, all you need is the "architectural file", saying "Thread#0's RAX instruction-pointer #0x80084412 is really RegisterFile[25]", and "Thread#0's RAX instruction pointer #0x80084418 is RegisterFile[26]", and finally "Thread#1's RAX instruction pointer #9001beef is RegisterFile[27]".
That allows all 3 contexts to execute out-of-order and concurrently. (In this case, the decoder has decided that Thread#0 has two things to execute in parallel for some reason, while Thread#1 only has one RAX allocated in its stream)
The retirement queue puts everything back into the original, programmer intended order before anyone notices whats up. Different threads on different sections of code will in practice use up a different number of registers.
A "linked-list" loop, like "while(node) node = node->next;" will be sequential and use almost none of the register file (impossible to extract parallelism, even with all of our Out-Of-Order tricks we have today), meaning Thread#2 can SMT and maybe use all 180-registers in the file for itself.
-----
Side note: this is why I think Apple's M1 design is kind of insane/terrible. Apple's register file is something like 380+ registers or somewhere there-abouts (on an ARM architecture only showing 31 registers to the assembly language), but does NOT in fact use SMT.
That is to say: Apple expects a singular thread to actually use those 380+ registers (!!!), and not really benefit from additional threads splitting up that absurdly huge register file. I'm sure Apple has done their research / simulations or whatever, and maybe it works out for the programs they've benchmarked / studied. But... I have to imagine that implementing SMT onto the Apple chips would be one of the easier ways to improve multithreaded performance.
Great summary, though given the overall opinion of the M1's performance they must've done something right. Maybe it's due in part to ARM being less strict on memory ordering that'd allow the M1s to utilize those registers.
Its no surprise that a core with a register file sized at 300+ would outperform a core with a register file of size 180. Significantly more parallelism can be found.
Apple's main risk was deciding that such large register files were in fact useful to somebody.
For highly optimized workloads (e.g. x264 video encoding or raytracers) I've seen about the same 20-25% increase at most. On general heavy and diverse workloads the number is rarely ever above 5%, while the power consumption and the security concerns run off all the same. I personally disable SMT on everything I can.
OpenBSD defaults to turning SMT off these days because the core team is of the same opinion. Theo actually expressed a similar opinion over 15 years ago.