Speculating the entire x86-64 instruction set in seconds with one weird trick

woodruffw · on March 25, 2021

This is a really clever technique! I was impressed by sandsifter[1] when it originally came out, and this seems an awful lot faster and less prone to false negatives (since it's purely speculative and doesn't require sandsifter's `#PF` hack).

At the risk of unwarranted self-promotion: the other side of this equation is fidelity in software instruction set decoders. x86's massive size and layers of historical complexity make it among the most difficult instruction formats to accurately decode; I've spent a good part of the last two years working on a fuzzer that's discovered thousands of bugs in various popular x86 decoders[2][3].

[1]: https://github.com/xoreaxeaxeax/sandsifter

[2]: https://github.com/trailofbits/mishegos

[3]: https://ww.easychair.org/publications/preprint_download/1LHr

blight · on March 25, 2021

be sure to check out the data set extracted by this research over at https://haruspex.can.ac/

alpb · on March 25, 2021

Off-topic: This was posted yesterday, but got no attention. (I tried to re-post yesterday, but got redirected to the existing post.) https://news.ycombinator.com/item?id=26576032 I wonder what makes HN disallow reposting of the same URL in a short periods of time but allowing it in long-term.

cbhl · on March 26, 2021

If I recall correctly, this heuristic was found experimentally:

Once upon a time, reposts were allowed, and so if something was topical and popular, sometimes you'd see multiple items on the front page pointing to the same URL. Sometimes the second or third posting would get more karma simply because of an editorialized title. This was undesirable, so they added the repost merging.

Later, an emergent behavior was that posts that report on something found on another site and had been previously posted would find their way to the front page. Some subset of HN visitors would be newer than the original post, and simply upvote something they hadn't seen before. But also there would be times where the post would be interesting with the added value of hindsight, or would provide context to the topic-of-the-day. So at some point it was decided that reposts would be allowed in the long-term, since sometimes they had value.

Thus we're at the state we have today.

dan-robertson · on March 26, 2021

Sometimes the moderators encourage reposts of unnoticed articles they thought were good (by reaching out to the poster directly). Although I don’t know if that’s what happened here. It could have been a sufficiently different url.

tomcam · on March 25, 2021

I think the first thing you described is called spamming

Iv · on March 26, 2021

tl;dr: he used a counter provided by intel that describes the total number of microcode instructions translated. He tried thousands of possible opcodes in "speculation mode" (this is the mode CPUs use to calculate both forks of a branch while waiting for the branch to be decided) and checked when an anomalous number of microcode instructions were translated.

He found 13 likely candidates for unpublished ops, including the 2 that were recently found. Also a few unpublished quirks of some known instructions.

tux3 · on March 26, 2021

A technical nitpick: speculation doesn't check both forks of a branch, it has to pick a side!

The CPU tries very hard to guess which way branches go and that allows it to speculate much further than if it tried every possible combinations.

What happens in this post is that the author writes a CALL instruction, but then manipulates the stack so that it doesn't actually return where the CPU expects it to. So the CPU will speculatively execute the instructions that follow the CALL linearly, even though they are never actually reached!

Iv · on March 26, 2021

Oh I did not know that about speculation!

alfiedotwtf · on March 26, 2021

Good article!

Now someone needs to do a follow-up to see what software uses these hidden opcodes :)

TrainedMonkey · on March 25, 2021

This just cinched to me that we need to sunset hardware x86 and run code that cannot be recompiled on emulators. x86 had a good run, but it becoming increasingly obvious that maintaining backward compatibility in modern high performance parts is incredibly expensive and bug ridden.

At this point sunsetting x86 is not even a pipe dream, most people carry ARM powered computers in their pockets and Apple recently demonstrated that it can be quite successful in high performance devices.

monocasa · on March 25, 2021

While some of the specifics are obviously x86 specific, I'm not sure that the underlying issues here are. From undocumented instructions, to lack of documentation about speculation barriers most of the root issues you see here are equally applicable to the cores in an M1.

als0 · on March 25, 2021

> to lack of documentation about speculation barriers

Can you elaborate on this? I was having a look at the M1 instructions here [1] and it seems that they implement at least one of the barriers, CSDB, which is actually documented by ARM[2].

[1] https://dougallj.github.io/applecpu/firestorm-int.html

[2] https://developer.arm.com/documentation/ddi0596/2020-12/Base...

monocasa · on March 25, 2021

There's explicit speculation barriers that they document, but they don't tell you what _other_ instructions are speculation barriers simply due to microarchitectural compromises like what you're seeing in this article.

als0 · on March 25, 2021

OK thanks for clarifying. If there are explicit instructions for blocking speculation then why are you concerned about implicit barriers?

monocasa · on March 25, 2021

I'm not. My point is that this article was being unfairly interpreted by the parent as some black mark against x86. There's plenty to hold against it, but not really anything that's documented here. I probably should have made that clearer.

phkahler · on March 25, 2021

Seems like implicit undocumented barriers could be used in critical code to provide unfair performance advantage. Maybe. Might be a bit of a stretch.

mhh__ · on March 25, 2021

I'm kind of curious whether Apple aren't documenting M1 because they can't be bothered, can't (not organized enough yet), or because it's their toy not ours.

monocasa · on March 25, 2021

D) Scared that disclosing microarchitectural details will open them up to more patent fights that they have to pay off without a lot of benefit to endusers or developers.

E) All of the above

marcan_42 · on March 26, 2021

E) Because they are doing things nobody else is allowed to (extending the ARMv8-A instruction set outside of the implementation-defined register mechanism, and blatantly breaking ARM architectural requirements like hard-coding VHE mode to on), and ARM lawyers are on the edge of their seats.

No, really, Apple are going out of their way not to document certain details about the M1, and this is almost certainly the reason. ARM are scared to death of fragmentation, Apple somehow managed to get away with doing it anyway, and ARM do not want anyone else to get any ideas.

skissane · on March 26, 2021

It is at least possible (if not likely) that Apple has a special deal with ARM that lets them get away with all this.

Which may mean that ARM wants them to be quiet about. Not because Apple is breaking the rules, but because ARM doesn't want to rub into its other customers the fact that Apple has a special deal that other customers aren't being offered.

saagarjha · on March 26, 2021

This sounds the most likely, tbh. Apple has been breaking the spec for years, but they’ve been very testy about ever mentioning their custom extensions and ARM has probably agreed to look the other way as long as they don’t draw attention to the fact.

skissane · on March 26, 2021

> ARM has probably agreed to look the other way as long as they don’t draw attention to the fact

That makes it sound like it is some kind of "tacit understanding", "I won't say anything if you don't say anything". I'm suggesting that maybe Apple's license agreement explicitly states that they are allowed to do all this, but also maybe its confidentiality clauses prohibit either party from publicly acknowledging that fact without the other party's explicit permission.

A company like Apple – who has a well-funded legal department, and I've never heard any suggestion that they are anything other than competent – wouldn't set up a multi-billion dollar business on the foundation of "agreed to look the other way". They'd have it all set out in writing, crystal clear and totally secret.

marcan_42 · on March 26, 2021

The difference is until now nobody could use those extensions, especially all the kernel mode stuff.

Now they can, since the M1 is an open platform (running third-party kernel code is supported).

monocasa · on March 26, 2021

AMX on the A13 was available from EL0 with no real issues.

marcan_42 · on March 27, 2021

That was one thing, and still undocumented. Suddenly you can also access all of these other things:

https://github.com/AsahiLinux/docs/wiki/HW:Apple-Instruction...

Plus their funky intel-emulation related CPU features which introduce architectural EL0 state (SSE-specific FP flags, AP flags). Plus their hardcoded VHE=1 spec breakage now becomes relevant at EL2. And almost certainly more things we haven't figured out yet.

davidw · on March 26, 2021

I am old enough to remember the "good old days" when there were OS/chip pairs from all the big companies. IBM, Sun, HP, SGI, DEC all had their own Unix versions that ran on their own hardware. That's part of the reason that tools like 'configure' are so ugly.

monocasa · on March 26, 2021

If Apple cared what ARM thought, they would have finished implementing the spec. AArch64 is as much Apple's as it is ARM's and I doubt they're scared. They have a do wtf you want license and ultimately don't really care if the rest of ARM withers and dies.

marcan_42 · on March 26, 2021

There is no such "do wtf you want" license. The ARM architectural license allows licensees (like Apple, Cavium, Samsung, and others) to implement CPUs that comply with the architecture. Some things are off-limits, like custom instructions. Apple broke those rules. They probably got away with doing that as a "special exception" under the understanding that they were doing so for "embedded" use cases, i.e. iOS devices that will only ever run iOS, but now that their designs are in an open platform (M1), things have changed, and ARM is not happy.

monocasa · on March 26, 2021

Apple literally designed large parts of AArch64. There's no do wtf you want license that you can go out and buy, but Apple has one none the less from the sheer fact that it's partially their's to begin with and they negotiated it with ARM before AArch64 was even a twinkle in their eyes.

ARM probably has some trademark mechanisms that they could in some world use to enforce compliance from Apple, but Apple has been super duper careful to refer to it as Apple Silicon rather than ARM.

skissane · on March 26, 2021

> There is no such "do wtf you want" license

How do you know? ARM's license agreements with Apple are confidential and just about nobody knows what's actually in them (and the few who have seen them aren't allowed to talk about it.)

The standard terms which ARM offers to its other customers may be quite different from the terms it has negotiated with Apple. Apple is a customer with unique clout and leverage, which means they can extract terms in negotiations which other customers may not be able to.

If ARM offers Apple (or any other customer) "special terms", it wants to keep that fact confidential, so that other customers who have the (less generous) standard terms don't start pressuring for special terms for them as well.

marcan_42 · on March 26, 2021

I know, because Apple went through the trouble of close-sourcing the M1 bring-up code in XNU source dumps... because it suddenly contains secret new instructions, while until now ARM XNU builds were largely public.

These instructions enable an entire new set of parallel execution modes, among other things, to run more protected kernel code. Completely nonstandard stuff.

Also, Apple somehow managed to push custom kernel support to the documentation of kmutil, but not the actual binary, until several macOS releases later. That would only happen if the latter got pulled for some extraneous reason.

I've been working on this since December; the entire situation is all showing signs pointing straight at ARM's lawyers having gotten involved to some extent.

skissane · on March 26, 2021

I think there is another explanation for this than "they are breaking ARM's rules and either trying to hide it or getting in trouble"

It is: "They are following ARM's special rules for Apple only but they aren't allowed to tell anyone that those special rules for Apple only exist"

They might decide to leave certain processor extensions undocumented, even close source code which uses them, because they are worried disclosing them would implicitly reveal the existence of the "special rules for Apple", possibly violating the confidentiality clauses of the ARM-Apple contract containing those special rules.

So I still don't see how you know that "There is no such 'do wtf you want' license", when the behaviour you describe is compatible with their being one, but its existence being secret

marcan_42 · on March 27, 2021

That's what I said: that they got a special exemption (which is not part of ARM's official license offering) to allow them to do this for embedded use cases (iOS devices), which implied that they wouldn't be documenting or having other people use these instructions.

The problem is that the M1 threw a wrench in the works because now suddenly running non-Apple code that can use these instructions is allowed, and everyone else can see that they exist, so suddenly the existence of that special exemption is public.

skissane · on March 27, 2021

> that they got a special exemption (which is not part of ARM's official license offering

I agree except I'd say that it is part of ARM's official license offering, but only if your name is "Apple".

monocasa · on March 26, 2021

The kmutil stuff sounds way more like they cut it at the last second for some QA issue, but the documentation tech writers are a different team and that part was missed. They wouldn't have figured out issues with lawyers this quickly.

The bring up stuff could be explained by being scared about some patent nonsense.

teclordphrack2 · on March 26, 2021

"There is no such "do wtf you want" license."

I have to say there is a "do wtf you want" license.

Hear me out. Say you like to speed and you have the money to pay fines and were okay with that. To a person like that what most people see as a "fine" becomes a "tax/fee."

What am I getting at? Apples probably has a hand full of multi-layered legal strategies that ARMs legal team is somewhat aware of. Apple has the money to throw at it. Heck, they can start investing any estimated payout and if they do lose they will of still profited the interest from that investment plus the increased sales they think they are going to get.

matthewmacleod · on March 25, 2021

This is such a weird meme to me. In what way is the thing you described at all "increasingly obvious"? We see this kind of statement all the time, but almost never accompanied by any sort of actual rationale for it.

retrac · on March 25, 2021

VME support was broken on Ryzen for a while before a microcode patch came out. The VME instructions are a relatively obscure, and originally Intel proprietary extension, to i386's virtual 8086 mode. Introduced in the 90s to speed up DOS virtual machines under OS/2 and NT, I think.

I don't know much about how stuff is implemented these days, but virtual 8086 mode involves some page table muckery and similar. Surely implementing it creates a larger exposure area, from a security standpoint.

mhh__ · on March 25, 2021

> but it becoming increasingly obvious that maintaining backward compatibility in modern high performance parts is incredibly expensive and bug ridden

Moving everything to ARM will not be cheap. As for bugs that is entirely dependent on the company making the chip, which ones do you have in mind? (Also recall that M1 is vulnerable to Spectre too).

I kind of hope X86s days are numbered as well, but I'm not looking forward to an ARM monoculture.

als0 · on March 25, 2021

> Moving everything to ARM will not be cheap

I thought the same thing until I saw how well Apple's Rosetta 2 works. Now that we've seen what's possible, I'm hoping that the x64 emulation in Windows/ARM will rise to the challenge set by Apple.

monocasa · on March 25, 2021

I imagine Apple has a patent on the cute optional TSO memory model that makes it work.

duskwuff · on March 25, 2021

I don't think they do.

The memory model itself isn't patentable; the TSO memory model already existed on x86, as well as many other architectures before that. Having an option to enable/disable TSO might be patentable, but it'd be a stretch; there's plenty of precedent for allowing a CPU to select between operating modes at runtime. (For example, many PowerPC parts could be switched between little-endian and big-endian modes at runtime, and some developers even used this to assist in emulating x86!)

What's more likely is simply that other ARM SoC vendors aren't implementing their own cores (e.g. they're using a standard Cortex-Ax design from ARM), so they can't add deeply integrated features like TSO themselves.

monocasa · on March 25, 2021

I mean, selectable, cheap, memory model changes are novel and I pretty much guarantee they have patents on some aspects of it. Depending on the patent there might be some cute ways to work around it, but it's going to take some work on your part. They might even have patents on other attempts to get to the same effect that they might have tried out.

PowerPC's biendianness would be patentable too if it had been an absolutely ancient technique about as old as computers themselves.

And there's people other than ARM and Apple making ARM cores. Samsung's Exynos M5 should be coming out before too long for instance.

gpderetta · on March 26, 2021

I believe POWER also has selectable TSO Mode. So did older sparks (before TSO became the only mode available).

monocasa · on March 26, 2021

> I believe POWER also has selectable TSO Mode

It does not.

gpderetta · on March 28, 2021

I believe it is called "Strong Access Ordering Mode" and it is defined in the ISA, but I can't find any good reference online.

edit: this patent from 2012 from IBM references SAO:

https://patents.google.com/patent/WO2012101538A1/en

anarazel · on March 26, 2021

Sparc did optional TSO 20+ years ago...

saagarjha · on March 26, 2021

TSO is just one part of Rosetta. Another significant performance win is that it’s sloppy with keeping the runtime out of the emulated address space to save on a (slow) software MMU, which may not be feasible for other implementations.

monocasa · on March 26, 2021

I'm not following, why would it require a soft MMU? They're not changing out the semantics of the virtual address space.

And the runtime isn't a security boundary so there's no reason to hide it.

I also don't see how what you're saying requires hardware support.

saagarjha · on March 26, 2021

This is less a security thing and more an emulation fidelity thing. A “true” emulator will make itself invisible from the running program so that it runs as if it was on the real system, but a Rosetta program can do things like modify the emulator or control flow in ways that are not possible on a real system because the runtime is mapped into the same address space without protection. For most Mac software this is not a problem because they will not try to access wild memory or enforce mappings, but on other platforms I can see this being infeasible (on Linux I can imagine programs that get upset if they can’t control the full address space outside of what the kernel reserves, for example).

monocasa · on March 26, 2021

It's extremely standard for user space only emulators to not attempt to hide themselves from the emulated code. Every implementation I can think of that derives from HP Dynamo ideas doesn't attempt to since fundamentally you need to have the translated code cache mapped anyway to be able to execute it. You're never going to be able to hide yourself completely wiht any useful perf without transmeta like special mmu hardware that has completely separate translations for I and D fetches.

You can see how the build directories for qemu are {architecture}-softmmu for the full system emulators, and {architecture}-linux-user for the user mode emulators for another example of this but with Linux binary emulation.

marcan_42 · on March 26, 2021

They don't. A few other ARM CPUs implement TSO already (by default, due to their design), it's just not an architectural/standard feature that software can rely on.

monocasa · on March 26, 2021

Right, other cores implement TSO by (because a stupid simple core will be TSO sort of by default), but it's a pretty unique thing to have a runtime TSO switch.

And it'd be incredibly stupid of them not to have a patent on it unless there's some clear prior art for exactly their implementation since we're in a first to file system in the US. Otherwise they're asking for someone else to patent it and troll them for an easy cash grab.

anarazel · on March 26, 2021

Sparc v9 had registers to change the consistency model. 1993.

See "5.2.1.4 PSTATE_mem_model (MM)" in https://cr.yp.to/2005-590/sparcv9.pdf

monocasa · on March 26, 2021

Fair enough. That's what I get for talking in absolute terms.

saagarjha · on March 26, 2021

Nvidia’s Carmel is sequentially consistent.

monocasa · on March 26, 2021

My whole point as been around _selectable_ consistency, which is a rarity. Yes, there are many cores with stricter than necessary consistency, but very very very few allow you to switch consistency models at runtime.

Hence my original statement about the "cute _optional_ TSO memory model".

mhh__ · on March 25, 2021

And to be able to use Rosetta 2 I only have to spend the combined value of all my computers again? Double that if I want the same amount of RAM and storage as I have now

als0 · on March 25, 2021

You don't have to use Rosetta 2 - that was an example of a good implementation. I did mention the announced x64 emulation on Windows, but that's only a preview release at the moment. If you're a Linux user the only thing I'm aware of is QEMU TCG but there might be faster projects out there.

mhh__ · on March 25, 2021

My point was that it doesn't suggest cheap any time soon.

muricula · on March 25, 2021

Many years ago I worked with a greybeard AMD hardware designer. He told me that they commissioned a study about whether it made sense to ditch backwards compatibility, and realized that the parts of the CPU needed to support backwards compatibility they were willing to ditch contributed to less than 1% of die area, and of course were all already designed and battle tested.

Unfortunately I don't have a better source for this than an anecdote.

jeffbee · on March 25, 2021

You could lose the entire x86-specific part of the die in a corner of a 512x512 FMAC unit. Die cost due to x86 complexity seemed like something worth attacking in 1990 when RISC was gaining mindshare but over the years the complexity of that has not expanded very much while the other stuff on the die has gotten much larger.

mhh__ · on March 25, 2021

That probably depends on whether you mean old instructions now cause a fault or moving x86 to a new format entirely that doesn't need a decoder from hell

sesuximo · on March 25, 2021

A lot of code doesn’t really care about speculative execution. It would be a shame to throw out decades of development and thousands of cpus just for one use case that didn’t totally fit it.