I thought I was the only one… I find ARM to be quite cumbersome and odd. I wouldn’t call it RISC by any means when it comes down the pragmatists, at least not in the traditional sense. I think its popularity stemmed from its energy efficiency rather than elegance…
RISC is great for implementation and programming simplicity. But these are not the only variables one optimizes a core for. The higher-end the implementation, the more internally complex or special-case extensions it may include, from AES acceleration instructions to signed pointers.
ARM's popularity also stems from its licensing, AFAICT. You can license various ARM cores, from the tiniest to the beefiest, to include in your SOC or MCU, mix and match them on the die, etc. This is not true for any reasonably recent x86 architecture, either from Intel or from AMD.
Interestingly, AMD64 has had signed pointers since the start: valid 64-bit virtual addresses are always derived by sign extension, even when the actual virtual address space is smaller than 64 bits. This avoids compatibility issues with future implementations supporting larger address spaces.
I think to some extent all ISAs that are asked to do performant computing inevitably drift from RISC/CISC into FISC (Fast Instruction Set Computing) in the sense they get instructions that are specifically designed to accelerate the most common workloads they'll be executing most commonly. In ARM64's case this is things like the infamous "javascript" instructions that emulate x86 floating point (which is also part of how javascript works for speed reasons). In x86 this has lead to a subset of instructions being super fast while anything off that beaten path is now slow (avoid the LOOP instruction at all costs!) as well as specific extensions being added for common work loads (AES, Crypto, FMA etc.).
aarch64 is only targeted at higher end devices where you'd expect to have at least 128 megs of RAM, enough to make the memory savings from Thumb2 not noticeable in practice.
Thumb2 was remarkable for getting so close to the performance of regular Arm code, but it was always still slower for almost any task even with the reduced instruction cache pressure.
RAM isn't much of a concern, relative to significantly increased needs for cache sizes and ROM sizes.
Aarch64 very much loses out; Microcontrollers are resource-constrained, so they won't use it.
High performance implementations (like Apple's M1) need to work around code density by making caches larger, or even having a microop cache; this all has transistor count cost, which imposes area cost, power cost and maximum clock cost.