The greater simplicity of ARMv8 and its fixed sized instructions definitely helps, but also Intel runs their cores at nearly 2x higher frequency, which means a lot less logic can be squeezed into a clock cycle. That makes it much harder to to make a wider processor.
Guess where some of Intel's engineers have fled to? People move around, so it's not like one company has a strangle-hold on knowledge that can't be replicated by another company, especially when one of those companies is willing to pay more for talent.
Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.
You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.
That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.
Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.
Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.
Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.
> the M1 is fast isn’t due to technical tricks, but due to Apple throwing a lot of hardware at the problem.
Apple threw more hardware at the problem and they lowered the frequency.
By lowering the frequency relative to AMD/Intel parts, they get two great advantages. 1) they use significantly less power and 2) they can do more work per cycle, making use of all of that extra hardware.