Box86/Box64 vs. QEMU vs. FEX (Vs Rosetta2)

ThatPlayer · on July 20, 2022

I've setup box86 on an ARM gaming handheld with a Qualcomm SDM845 recently and it's pretty amazing what it can do. The SDM845 has a pretty good Linux support (with postmarketOS supporting the mainline kernel [0]). The open source drivers for the Adreno GPU even support Vulkan and full desktop OpenGL

With box86/box64, I've been able to run Steam and even Wine/Proton with DXVK translating DirectX to Vulkan. I can even run older 3D Windows games like Skyrim! Though it did glitch on the infamous cart intro.

[0] https://wiki.postmarketos.org/wiki/SDM845_Mainlining

phh · on July 20, 2022

Can you share more about your setup? I also have a sdm845 gaming handheld (Ayn Odin, currently running Android), and I was contemplating installing Windows on it to get Steam, but I much prefer your way.

Do you have some clean distro that boots into something usable without mouse/keyboard? Some documentation for first stage boots? Gits?

ThatPlayer · on July 20, 2022

Yes, that's the same one I have. It's the similar install to windows, running an edk2 bootloader on another partition. The developer for that has released a Debian 11 install that has working touch controls and software keyboard, though I have been using an USB-C hub with actual mouse and keyboard for setup as the UI isn't scaled well for a 5" screen.

https://github.com/ProjectValhalla/OdinMultiBootGuides

I don't think it's ready yet for full time use, as the joystick is mapped incorrectly for most games, but something to keep an eye on.

eptcyka · on July 20, 2022

What's the device you're using? Color me interested.

ThatPlayer · on July 20, 2022

The Ayn Odin. I'm not sure I'd recommend it with all the new upcoming x86_64 gaming handhelds coming out soon with similar pricing, better GPU drivers, and not having to deal with box86 compatibility issues.

https://liliputing.com/2022/06/compare-handheld-gaming-pc-sp...

simjnd · on July 20, 2022

This is a great post, petitSeb is doing an outstanding job on Box86/64. I'm also keeping an eye on FEX which is evolving very rapidly. There has been 4 releases since the linked blog post which was written in late March, introducing very welcome features such as support for pressure-vessel or better OpenGL and Vulkan thunking.

olliej · on July 20, 2022

yeah, I liked how they explicitly distinguished the benchmark apps that made significant use of x87 as supporting x87 is necessarily an all software floating point implementation - it's impossible to get close to native performance for x87 heavy code on non-x86 architectures.

lunixbochs · on July 20, 2022

I don't know if I agree with "impossible". There's a lot of performance left on the table with SoftFloat. A non x86 architecture can add an 80-bit FPU if they want. There are architectures with 128-bit float, and CPUs with FPGA coprocessors. I suspect x87 is also not the most optimized path in modern x86 cpus (some instructions may even be fully emulated in microcode).

Realistically, an x87-specific JIT could do significant instruction reordering, lift/reoptimize the underlying code (much of existing x87 code was compiled a very long time ago on older compilers), and vectorize the underlying integer float emulation, or even trace and move some computation to another core or a coprocessor like a GPU or DSP (often idle in embedded cpus).

Many games work fine with x87 lowered to 64-bit or even 32-bit floats, and depending on the workload there's a middle ground where you could understand (or approximate) the current level of precision error for a value, generally run at a lower precision, and trace operations / "catch up" on precision at batched intervals.

olliej · on July 20, 2022

Sorry, it is obviously possible to add hardware support for the 80bit ieee754 format (the format itself is not great, and in reality the precision isn’t necessary in all but the most extreme cases, and those where it is are likely to prefer 128bit float), but it isn’t something that is going to happen in the real world, and even if it was we’re talking about software for generally available systems.**

You could also emulate it by arbitrarily dropping precision, but as a translator that means breaking bincompat, and more importantly breaking programs the use 80bit format (a lot of fortran).

Obviously many games (especially old ones) perform fine as they’re only using 80bit because at the time x87 was the only hardware fp available on x86 hardware, not because they needed that perf.

Even lowering the precision of the x87 unit isn’t sufficient as that only reduces the precision of the mantissa not the exponent.

Even outside of the core arithmetic (excluding negation which is really easy in all ieee754 formats) there is a whole bunch of state that you need to keep track of to ensure identical behavior.

Obviously if you are willing to break precision guarantees, etc then breaking state isn’t a problem, but if you’re trying to be something like Rosetta - eg completely general and running anything - you don’t really have the freedom to do that.

** sorry skim reading I missed your 128bit and x87 perf questions. Yes an emulator can (should?) use hw 128bit for the arithmetic if it’s available but on vast majority of hardware it isn’t.

You are also right about x87 perf being slow compared to everything else, but it’s still faster than anything you can do in software (addition especially does not work interact nicely) due to the GRS tracking a software impl needs to do through many bitewise operations.

lunixbochs · on July 20, 2022

My middle paragraph up-thread proposes that you can emulate it much faster than we're doing now, at full precision with integer SIMD and a specialized JIT. I'll reiterate the 80-bit softfloat stuff I've seen in use now is not really optimized. I suspect that beating the performance of a cpu on x87 from the era where x87 was relevant is somewhere between realistic and trivial. Beating a modern cpu on x87 from another architecture still feels possible (but it's a less useful thing to spend time on).

> Even lowering the precision of the x87 unit isn’t sufficient as that only reduces the precision of the mantissa not the exponent.

I don't know what you mean by "isn't sufficient". To be clear, I'm speaking from experience emulating x86 games on low resource arm devices, where I had success emulating x87 in lower precision.

For QEMU, IMO the bigger performance issue is that it doesn't natively JIT _any_ FPU or vector instructions, and the indirect memory mapping hurts general performance quite a bit too.

olliej · on July 21, 2022

Oh I have no idea when qemu wouldn't be doing those, but I wasn't very clear about precision.

The x87 unit has control bits that you control (shocking!) behaviour of the unit, one of the things you can control is the precision it will operate at operate. People think that if the x87 unit is in the lower precision modes it's possible to simply use the common 32 or 64 bit FPUs, but the x87's 32/64 bit modes only impact the mantissa, not the the exponent so they reduced precision modes are still not interchangeable with fp32 or fp64.

lunixbochs · on July 20, 2022

I'm excited Rosetta2 can be used in Linux VMs as of macOS Ventura. QEMU tends to be the most accurate emulation option for me on Linux and is nowhere near the speed of Rosetta. (Rosetta is quite accurate as well, it just wasn't available for Linux). I only have one remaining edge case with Rosetta around the FPU config register not behaving quite the same way as an x86 CPU, everything else has been great.

FEX can be quite fast at some workloads, but was slower than QEMU for others, and had some glitches for me. I ended up porting my app to arm64 Linux for Linux dev on M1 rather than continue to slog through the issues I had with emulation on Linux.

> I couldn’t include FEX in the bench as it’s not compatible with the 16k page actualy used on Asahi/M1.

FEX ran fine for me in a Parallels VM on M1.

IntelMiner · on July 22, 2022

Asahi is Linux on M1. Parallels runs atop OS X which uses different page sizes

CoastalCoder · on July 20, 2022

Anyone know why qemu is so slow vs. the others?

The article discusses differences in floating point handling and GPU passthrough, but I don't think the 7z benchmark uses either of those.

lunixbochs · on July 20, 2022

TCG has historically had more of a focus on accuracy than performance. It lifts a lot of guest architectures to a lot of host architectures, and isn't particularly specialized to any given host cpu type. It lifts many instructions to C helpers instead of bothering to jit them. Last I checked it had no vector -> vector jit. It's also not single address mapped - memory IO undergoes indirection, which is expensive. I think Rosetta for example has a shared address space for the guest and host code. Honestly on 64-bit CPUs, especially with pointer authentication on M1, the risk of the guest accidentally messing with host/jit memory is low.

yjftsjthsd-h · on July 20, 2022

Possibly because they don't care as much. Until very recently, the heaviest use of qemu was to run hardware accelerated virtual machines on the same architecture. If you're using it with KVM/HAXM/whatever, it is fast. I expect they would be happy to take performance enhancements for emulation, but that it simply hasn't been a priority.

rnk · on July 20, 2022

I don't have an apple arm device, I was waiting until I could run x86 vms reasonably efficiently, because I need that all the time. Up to now it seemed the answer was it's too slow if you want to use an x86 basically in a normal way with a vm. This article suggests rosetta2 would let you have usable performance, can someone provide the high level view? r2 was about 2/3 the speed of native exec on that last benchmark the article.

olliej · on July 20, 2022

You would be able to run x86 code on an arm Linux vm. There is no VM option better than qemu or similar.

The problem as far as I can infer is that for a binary translator the translator is given a bunch of context a full VM can’t have (what random clump of bytes is an executable, etc)

rnk · on July 23, 2022

Would it be useless for trying to run an x86 os, with binaries? I really do need to occasionally run something on x86 based linux but I could probably avoid it most of the time.