Hacker News new | past | comments | ask | show | jobs | submit login
Speculating the entire x86-64 instruction set in seconds with one weird trick (can.ac)
208 points by muricula on March 25, 2021 | hide | past | favorite | 65 comments



This is a really clever technique! I was impressed by sandsifter[1] when it originally came out, and this seems an awful lot faster and less prone to false negatives (since it's purely speculative and doesn't require sandsifter's `#PF` hack).

At the risk of unwarranted self-promotion: the other side of this equation is fidelity in software instruction set decoders. x86's massive size and layers of historical complexity make it among the most difficult instruction formats to accurately decode; I've spent a good part of the last two years working on a fuzzer that's discovered thousands of bugs in various popular x86 decoders[2][3].

[1]: https://github.com/xoreaxeaxeax/sandsifter

[2]: https://github.com/trailofbits/mishegos

[3]: https://ww.easychair.org/publications/preprint_download/1LHr


be sure to check out the data set extracted by this research over at https://haruspex.can.ac/


Off-topic: This was posted yesterday, but got no attention. (I tried to re-post yesterday, but got redirected to the existing post.) https://news.ycombinator.com/item?id=26576032 I wonder what makes HN disallow reposting of the same URL in a short periods of time but allowing it in long-term.


If I recall correctly, this heuristic was found experimentally:

Once upon a time, reposts were allowed, and so if something was topical and popular, sometimes you'd see multiple items on the front page pointing to the same URL. Sometimes the second or third posting would get more karma simply because of an editorialized title. This was undesirable, so they added the repost merging.

Later, an emergent behavior was that posts that report on something found on another site and had been previously posted would find their way to the front page. Some subset of HN visitors would be newer than the original post, and simply upvote something they hadn't seen before. But also there would be times where the post would be interesting with the added value of hindsight, or would provide context to the topic-of-the-day. So at some point it was decided that reposts would be allowed in the long-term, since sometimes they had value.

Thus we're at the state we have today.


Sometimes the moderators encourage reposts of unnoticed articles they thought were good (by reaching out to the poster directly). Although I don’t know if that’s what happened here. It could have been a sufficiently different url.


I think the first thing you described is called spamming


tl;dr: he used a counter provided by intel that describes the total number of microcode instructions translated. He tried thousands of possible opcodes in "speculation mode" (this is the mode CPUs use to calculate both forks of a branch while waiting for the branch to be decided) and checked when an anomalous number of microcode instructions were translated.

He found 13 likely candidates for unpublished ops, including the 2 that were recently found. Also a few unpublished quirks of some known instructions.


A technical nitpick: speculation doesn't check both forks of a branch, it has to pick a side!

The CPU tries very hard to guess which way branches go and that allows it to speculate much further than if it tried every possible combinations.

What happens in this post is that the author writes a CALL instruction, but then manipulates the stack so that it doesn't actually return where the CPU expects it to. So the CPU will speculatively execute the instructions that follow the CALL linearly, even though they are never actually reached!


Oh I did not know that about speculation!


Good article!

Now someone needs to do a follow-up to see what software uses these hidden opcodes :)


This just cinched to me that we need to sunset hardware x86 and run code that cannot be recompiled on emulators. x86 had a good run, but it becoming increasingly obvious that maintaining backward compatibility in modern high performance parts is incredibly expensive and bug ridden.

At this point sunsetting x86 is not even a pipe dream, most people carry ARM powered computers in their pockets and Apple recently demonstrated that it can be quite successful in high performance devices.


While some of the specifics are obviously x86 specific, I'm not sure that the underlying issues here are. From undocumented instructions, to lack of documentation about speculation barriers most of the root issues you see here are equally applicable to the cores in an M1.


> to lack of documentation about speculation barriers

Can you elaborate on this? I was having a look at the M1 instructions here [1] and it seems that they implement at least one of the barriers, CSDB, which is actually documented by ARM[2].

[1] https://dougallj.github.io/applecpu/firestorm-int.html

[2] https://developer.arm.com/documentation/ddi0596/2020-12/Base...


There's explicit speculation barriers that they document, but they don't tell you what _other_ instructions are speculation barriers simply due to microarchitectural compromises like what you're seeing in this article.


OK thanks for clarifying. If there are explicit instructions for blocking speculation then why are you concerned about implicit barriers?


I'm not. My point is that this article was being unfairly interpreted by the parent as some black mark against x86. There's plenty to hold against it, but not really anything that's documented here. I probably should have made that clearer.


Seems like implicit undocumented barriers could be used in critical code to provide unfair performance advantage. Maybe. Might be a bit of a stretch.


I'm kind of curious whether Apple aren't documenting M1 because they can't be bothered, can't (not organized enough yet), or because it's their toy not ours.


D) Scared that disclosing microarchitectural details will open them up to more patent fights that they have to pay off without a lot of benefit to endusers or developers.

E) All of the above


E) Because they are doing things nobody else is allowed to (extending the ARMv8-A instruction set outside of the implementation-defined register mechanism, and blatantly breaking ARM architectural requirements like hard-coding VHE mode to on), and ARM lawyers are on the edge of their seats.

No, really, Apple are going out of their way not to document certain details about the M1, and this is almost certainly the reason. ARM are scared to death of fragmentation, Apple somehow managed to get away with doing it anyway, and ARM do not want anyone else to get any ideas.


It is at least possible (if not likely) that Apple has a special deal with ARM that lets them get away with all this.

Which may mean that ARM wants them to be quiet about. Not because Apple is breaking the rules, but because ARM doesn't want to rub into its other customers the fact that Apple has a special deal that other customers aren't being offered.


This sounds the most likely, tbh. Apple has been breaking the spec for years, but they’ve been very testy about ever mentioning their custom extensions and ARM has probably agreed to look the other way as long as they don’t draw attention to the fact.


> ARM has probably agreed to look the other way as long as they don’t draw attention to the fact

That makes it sound like it is some kind of "tacit understanding", "I won't say anything if you don't say anything". I'm suggesting that maybe Apple's license agreement explicitly states that they are allowed to do all this, but also maybe its confidentiality clauses prohibit either party from publicly acknowledging that fact without the other party's explicit permission.

A company like Apple – who has a well-funded legal department, and I've never heard any suggestion that they are anything other than competent – wouldn't set up a multi-billion dollar business on the foundation of "agreed to look the other way". They'd have it all set out in writing, crystal clear and totally secret.


The difference is until now nobody could use those extensions, especially all the kernel mode stuff.

Now they can, since the M1 is an open platform (running third-party kernel code is supported).


AMX on the A13 was available from EL0 with no real issues.


That was one thing, and still undocumented. Suddenly you can also access all of these other things:

https://github.com/AsahiLinux/docs/wiki/HW:Apple-Instruction...

Plus their funky intel-emulation related CPU features which introduce architectural EL0 state (SSE-specific FP flags, AP flags). Plus their hardcoded VHE=1 spec breakage now becomes relevant at EL2. And almost certainly more things we haven't figured out yet.


I am old enough to remember the "good old days" when there were OS/chip pairs from all the big companies. IBM, Sun, HP, SGI, DEC all had their own Unix versions that ran on their own hardware. That's part of the reason that tools like 'configure' are so ugly.


If Apple cared what ARM thought, they would have finished implementing the spec. AArch64 is as much Apple's as it is ARM's and I doubt they're scared. They have a do wtf you want license and ultimately don't really care if the rest of ARM withers and dies.


There is no such "do wtf you want" license. The ARM architectural license allows licensees (like Apple, Cavium, Samsung, and others) to implement CPUs that comply with the architecture. Some things are off-limits, like custom instructions. Apple broke those rules. They probably got away with doing that as a "special exception" under the understanding that they were doing so for "embedded" use cases, i.e. iOS devices that will only ever run iOS, but now that their designs are in an open platform (M1), things have changed, and ARM is not happy.


Apple literally designed large parts of AArch64. There's no do wtf you want license that you can go out and buy, but Apple has one none the less from the sheer fact that it's partially their's to begin with and they negotiated it with ARM before AArch64 was even a twinkle in their eyes.

ARM probably has some trademark mechanisms that they could in some world use to enforce compliance from Apple, but Apple has been super duper careful to refer to it as Apple Silicon rather than ARM.


> There is no such "do wtf you want" license

How do you know? ARM's license agreements with Apple are confidential and just about nobody knows what's actually in them (and the few who have seen them aren't allowed to talk about it.)

The standard terms which ARM offers to its other customers may be quite different from the terms it has negotiated with Apple. Apple is a customer with unique clout and leverage, which means they can extract terms in negotiations which other customers may not be able to.

If ARM offers Apple (or any other customer) "special terms", it wants to keep that fact confidential, so that other customers who have the (less generous) standard terms don't start pressuring for special terms for them as well.


I know, because Apple went through the trouble of close-sourcing the M1 bring-up code in XNU source dumps... because it suddenly contains secret new instructions, while until now ARM XNU builds were largely public.

These instructions enable an entire new set of parallel execution modes, among other things, to run more protected kernel code. Completely nonstandard stuff.

Also, Apple somehow managed to push custom kernel support to the documentation of kmutil, but not the actual binary, until several macOS releases later. That would only happen if the latter got pulled for some extraneous reason.

I've been working on this since December; the entire situation is all showing signs pointing straight at ARM's lawyers having gotten involved to some extent.


I think there is another explanation for this than "they are breaking ARM's rules and either trying to hide it or getting in trouble"

It is: "They are following ARM's special rules for Apple only but they aren't allowed to tell anyone that those special rules for Apple only exist"

They might decide to leave certain processor extensions undocumented, even close source code which uses them, because they are worried disclosing them would implicitly reveal the existence of the "special rules for Apple", possibly violating the confidentiality clauses of the ARM-Apple contract containing those special rules.

So I still don't see how you know that "There is no such 'do wtf you want' license", when the behaviour you describe is compatible with their being one, but its existence being secret


That's what I said: that they got a special exemption (which is not part of ARM's official license offering) to allow them to do this for embedded use cases (iOS devices), which implied that they wouldn't be documenting or having other people use these instructions.

The problem is that the M1 threw a wrench in the works because now suddenly running non-Apple code that can use these instructions is allowed, and everyone else can see that they exist, so suddenly the existence of that special exemption is public.


> that they got a special exemption (which is not part of ARM's official license offering

I agree except I'd say that it is part of ARM's official license offering, but only if your name is "Apple".


The kmutil stuff sounds way more like they cut it at the last second for some QA issue, but the documentation tech writers are a different team and that part was missed. They wouldn't have figured out issues with lawyers this quickly.

The bring up stuff could be explained by being scared about some patent nonsense.


"There is no such "do wtf you want" license."

I have to say there is a "do wtf you want" license.

Hear me out. Say you like to speed and you have the money to pay fines and were okay with that. To a person like that what most people see as a "fine" becomes a "tax/fee."

What am I getting at? Apples probably has a hand full of multi-layered legal strategies that ARMs legal team is somewhat aware of. Apple has the money to throw at it. Heck, they can start investing any estimated payout and if they do lose they will of still profited the interest from that investment plus the increased sales they think they are going to get.


This is such a weird meme to me. In what way is the thing you described at all "increasingly obvious"? We see this kind of statement all the time, but almost never accompanied by any sort of actual rationale for it.


VME support was broken on Ryzen for a while before a microcode patch came out. The VME instructions are a relatively obscure, and originally Intel proprietary extension, to i386's virtual 8086 mode. Introduced in the 90s to speed up DOS virtual machines under OS/2 and NT, I think.

I don't know much about how stuff is implemented these days, but virtual 8086 mode involves some page table muckery and similar. Surely implementing it creates a larger exposure area, from a security standpoint.


> but it becoming increasingly obvious that maintaining backward compatibility in modern high performance parts is incredibly expensive and bug ridden

Moving everything to ARM will not be cheap. As for bugs that is entirely dependent on the company making the chip, which ones do you have in mind? (Also recall that M1 is vulnerable to Spectre too).

I kind of hope X86s days are numbered as well, but I'm not looking forward to an ARM monoculture.


> Moving everything to ARM will not be cheap

I thought the same thing until I saw how well Apple's Rosetta 2 works. Now that we've seen what's possible, I'm hoping that the x64 emulation in Windows/ARM will rise to the challenge set by Apple.


I imagine Apple has a patent on the cute optional TSO memory model that makes it work.


I don't think they do.

The memory model itself isn't patentable; the TSO memory model already existed on x86, as well as many other architectures before that. Having an option to enable/disable TSO might be patentable, but it'd be a stretch; there's plenty of precedent for allowing a CPU to select between operating modes at runtime. (For example, many PowerPC parts could be switched between little-endian and big-endian modes at runtime, and some developers even used this to assist in emulating x86!)

What's more likely is simply that other ARM SoC vendors aren't implementing their own cores (e.g. they're using a standard Cortex-Ax design from ARM), so they can't add deeply integrated features like TSO themselves.


I mean, selectable, cheap, memory model changes are novel and I pretty much guarantee they have patents on some aspects of it. Depending on the patent there might be some cute ways to work around it, but it's going to take some work on your part. They might even have patents on other attempts to get to the same effect that they might have tried out.

PowerPC's biendianness would be patentable too if it had been an absolutely ancient technique about as old as computers themselves.

And there's people other than ARM and Apple making ARM cores. Samsung's Exynos M5 should be coming out before too long for instance.


I believe POWER also has selectable TSO Mode. So did older sparks (before TSO became the only mode available).


> I believe POWER also has selectable TSO Mode

It does not.


I believe it is called "Strong Access Ordering Mode" and it is defined in the ISA, but I can't find any good reference online.

edit: this patent from 2012 from IBM references SAO:

https://patents.google.com/patent/WO2012101538A1/en


Sparc did optional TSO 20+ years ago...


TSO is just one part of Rosetta. Another significant performance win is that it’s sloppy with keeping the runtime out of the emulated address space to save on a (slow) software MMU, which may not be feasible for other implementations.


I'm not following, why would it require a soft MMU? They're not changing out the semantics of the virtual address space.

And the runtime isn't a security boundary so there's no reason to hide it.

I also don't see how what you're saying requires hardware support.


This is less a security thing and more an emulation fidelity thing. A “true” emulator will make itself invisible from the running program so that it runs as if it was on the real system, but a Rosetta program can do things like modify the emulator or control flow in ways that are not possible on a real system because the runtime is mapped into the same address space without protection. For most Mac software this is not a problem because they will not try to access wild memory or enforce mappings, but on other platforms I can see this being infeasible (on Linux I can imagine programs that get upset if they can’t control the full address space outside of what the kernel reserves, for example).


It's extremely standard for user space only emulators to not attempt to hide themselves from the emulated code. Every implementation I can think of that derives from HP Dynamo ideas doesn't attempt to since fundamentally you need to have the translated code cache mapped anyway to be able to execute it. You're never going to be able to hide yourself completely wiht any useful perf without transmeta like special mmu hardware that has completely separate translations for I and D fetches.

You can see how the build directories for qemu are {architecture}-softmmu for the full system emulators, and {architecture}-linux-user for the user mode emulators for another example of this but with Linux binary emulation.


They don't. A few other ARM CPUs implement TSO already (by default, due to their design), it's just not an architectural/standard feature that software can rely on.


Right, other cores implement TSO by (because a stupid simple core will be TSO sort of by default), but it's a pretty unique thing to have a runtime TSO switch.

And it'd be incredibly stupid of them not to have a patent on it unless there's some clear prior art for exactly their implementation since we're in a first to file system in the US. Otherwise they're asking for someone else to patent it and troll them for an easy cash grab.


Sparc v9 had registers to change the consistency model. 1993.

See "5.2.1.4 PSTATE_mem_model (MM)" in https://cr.yp.to/2005-590/sparcv9.pdf


Fair enough. That's what I get for talking in absolute terms.


Nvidia’s Carmel is sequentially consistent.


My whole point as been around _selectable_ consistency, which is a rarity. Yes, there are many cores with stricter than necessary consistency, but very very very few allow you to switch consistency models at runtime.

Hence my original statement about the "cute _optional_ TSO memory model".


And to be able to use Rosetta 2 I only have to spend the combined value of all my computers again? Double that if I want the same amount of RAM and storage as I have now


You don't have to use Rosetta 2 - that was an example of a good implementation. I did mention the announced x64 emulation on Windows, but that's only a preview release at the moment. If you're a Linux user the only thing I'm aware of is QEMU TCG but there might be faster projects out there.


My point was that it doesn't suggest cheap any time soon.


Many years ago I worked with a greybeard AMD hardware designer. He told me that they commissioned a study about whether it made sense to ditch backwards compatibility, and realized that the parts of the CPU needed to support backwards compatibility they were willing to ditch contributed to less than 1% of die area, and of course were all already designed and battle tested.

Unfortunately I don't have a better source for this than an anecdote.


You could lose the entire x86-specific part of the die in a corner of a 512x512 FMAC unit. Die cost due to x86 complexity seemed like something worth attacking in 1990 when RISC was gaining mindshare but over the years the complexity of that has not expanded very much while the other stuff on the die has gotten much larger.


That probably depends on whether you mean old instructions now cause a fault or moving x86 to a new format entirely that doesn't need a decoder from hell


A lot of code doesn’t really care about speculative execution. It would be a shame to throw out decades of development and thousands of cpus just for one use case that didn’t totally fit it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: