X86 Register Encoding

pcwalton · on Nov 1, 2016

LLVM doesn't actually prefer the lower registers when doing register allocation. IIRC, sunfish told me that GCC doesn't either. It would be interesting to add features that try to minimize code size to the register allocator, but no compiler I know of actually does this.

Partially as a consequence of this, the REX prefixes take up a lot of space in most x86-64 instruction streams. In fact, the average size of each instruction is almost exactly 4 bytes, exactly the same as in classic 32-bit RISC architectures. (This is why I dislike it when people link to that old Linus post about how x86 is better than RISC architectures because of code size; it may have been true then, but not now.)

stoklund · on Nov 1, 2016

LLVM has a CostPerUse attribute on register definitions that both x86-64 and ARM's Thumb2 mode use. See http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/...

The greedy register allocator uses the CostPerUse hints to prefer the lower registers, but it has to balance many optimization goals at once, so it is not always obvious what is going on. For example, it will:

- Defer using a callee-saved register for the first time until it is worth the cost of spilling it in the prologue. It won't create a prolog spill just to use a better register. - Try to reuse function argument and return value registers to minimize register copies around function calls.

IIRC, the CostPerUse attribute gave a code size savings of around 1% on x86-64 and Thumb2. On both architectures, the effect is reduced by the fact that all register operands must be from the low register set before you can use the smaller encoding.

pcwalton · on Nov 1, 2016

Interesting! My info was out of date then. Thanks!

kr7 · on Nov 1, 2016

> In fact, the average size of each instruction is almost exactly 4 bytes

I was curious about this claim so I wrote a quick and dirty perl script to test it out:

    vmlinuz-linux = 2.71957329365681
    bash          = 3.95324321387071
    firefox       = 3.56640365053712
    geany         = 3.44776119402985
    gzip          = 4.17925462998247
    perl          = 4.10894941634241
    python        = 3.82481751824818
    tar           = 3.93348845041101
    thunar        = 3.82471016115786
    radeon_drv.so = 4.02697782644402

Seems spot on, except for the kernel. I'm not sure why that has shorter instructions on average.

EDIT: I ran it on git too; average was 3.96747336803081. It seems Linus' x86 wizardry does not extend to userspace in this case.

slashdev · on Nov 1, 2016

The kernel is compiled with -Os for code size last I checked (which was long ago) so that might explain the difference.

a-priori · on Nov 1, 2016

It's been about ten years since I did any kernel work, but last I checked Linux doesn't use most of the fancy features of the CPU to avoid trashing more registers than necessary. If you know you don't do any floating point operations, for example, you know you don't need to save/restore the floating point registers on interrupts. The same might go for things like SSE.

Those instructions tend to be larger, so it may partly explain the difference in average instruction size. The same explanation would apply to why the Radeon driver has a higher average.

flamedoge · on Nov 2, 2016

I wonder if that flag does anything at the instruction encoding level

Flow · on Nov 2, 2016

Interesting, would you mind sharing the script or give a short overview how it works?

kr7 · on Nov 2, 2016

Here it is (warning: Perl): http://pastebin.com/MFmDzY0g

It calls "objdump -d" on the binary, which is binutil's disassembler.

objdump prints each instruction out in the following format:

     5dc:       67 80 7d 00 00          cmpb   $0x0,0x0(%ebp)

The first part is the instruction offset; the second is the bytes that comprise the instruction; the third part is the instruction in AT&T assembly syntax.

The script uses regexs to cut off the front and back, then counts the number of bytes (in hex format) in the second part.

Then it calculates the average for all instructions.

TheDauthi · on Nov 2, 2016

Might I suggest http://pastebin.com/2PHXi78A instead? It handles a few bugs from the original, and cleans the nested loops.

slashdev · on Nov 1, 2016

You can get quite large code size savings by using the original 8 registers everywhere you can get away with 32bit instructions - which is quite often, since they zero the upper 32 bits. e.g. XORL eax, eax will zero 64bit rax. Compilers don't seem very good at this, but I find I can usually save 10% code size when coding assembly routines just by paying attention to instruction width and which registers are used where. With 64bit instructions it doesn't matter which registers you use, the rex prefix is used anyway.

mikeash · on Nov 1, 2016

Wouldn't you also have to look at the average number of instructions, not just the average instruction size? x86 instructions tend to do more, too.

pcwalton · on Nov 1, 2016

The binary sizes are also similar when I last measured. Keep in mind that (a) ARM and AArch64 have quite a few addressing modes as well; (b) ARM has things like LDMIA/STMDB and AArch64's LDP/STP that compress function prologs and epilogs; (c) more registers means you don't have to spill as much; (d) three address instructions are often more compact than a MOV plus a two address instruction, which helps with typical register allocation algorithms.

mikeash · on Nov 1, 2016

Thanks for the info, to you and all the other replies!

brigade · on Nov 1, 2016

Every time I've looked at a hot function across x86-64 and arm(64), x86 needs more instructions too. Really the only thing x86 does more in one instruction that compilers actually use is memory operands, but they'll readily split them if it reduces the total number of loads. Which is quite often. (probably a bad choice though with modern cores)

Lea too if your arithmetic fits in it... but then that has issue restrictions... Three operand and the various conditional stuff in arm64 is worth more overall.

pcwalton · on Nov 1, 2016

Well, to be fair, LEA does allow for more compact encoding than RISC add/shift/add if the operations actually fit the template. But it's not enough to overcome the waste elsewhere. Most of the benefit of LEA is that it's three-address rather than two and that it doesn't clobber the flags, but those are solutions to problems that RISC doesn't have in the first place.

Edit: Ah well, beaten by your edit. :)

bonzini · on Nov 1, 2016

Not necessarily. For example except for LEA, they do 2-operand arithmetic. The only thing where x86 wins hands down is addressing modes and immediate moves.

pklausler · on Nov 1, 2016

I'll grant you immediate moves, but not addressing modes. It lacks the auto-increment/update possibilities that were in the pdp-11, vax, mc68k, and power(pc), and are great for cutting down on instruction counts in loops. (Hell, on the CDC6600, address register updates were the only way you could do loads and stores; many good ideas didn't survive into the impoverished x86 era.)

bertr4nd · on Nov 1, 2016

HHVM's register allocator has a few tricks to minimize code size. Among other things it tends to prefer the old regs to the new x64 ones. As you might imagine code size is a significant challenge for Facebook so these kind of optimizations tended to give nice wins.

munificent · on Nov 1, 2016

> As you might imagine code size is a significant challenge for Facebook

Maybe I'm missing something obvious, but why is code size a big concern for Facebook's server code?

twic · on Nov 1, 2016

"Because Facebook's entire code base is compiled down to a single binary executable, the company's deployment process is quite different from what you'd normally expect in a PHP environment. Rossi told me that the binary, which represents the entire Facebook application, is approximately 1.5GB in size."

http://arstechnica.com/business/2012/04/exclusive-a-behind-t...

bertr4nd · on Nov 2, 2016

It's not compiled ahead of time any more, but yeah, that gives you an idea of just how darn big the server side code is.

stephencanon · on Nov 1, 2016

N.B. the "Volatile?" column is specific to the Windows calling conventions. Under the Sys V calling conventions (i.e. what the world outside of Redmond uses), RDI and RSI are volatile (and used for passing the first two integer arguments).

pklausler · on Nov 1, 2016

The OP perpetuates the mistaken assumption that x86-64 looks as it does because it extends good ol' 32-bit x86 encodings, which one might assume still work so well that one could run 32-bit code in 64-bit mode and have it still work.

Which is not the case at all. Those REX prefix bytes used to be perfectly good 32-bit x86 instructions that now simply don't work in 64-bit mode with their original encodings. So the "compatibility" between 32-bit and 64-bit modes is mythical -- the Opteron could have had a nice shiny new 64-bit programming model that was far less confusing than the dog's breakfast that is x86-64, but just didn't.

ant6n · on Nov 1, 2016

They probably wanted to re-use most of the x86-32 decoder that they had on the chip anyway.

Kind of strange, nowadays, an embedded armv8 cpu will come with multiple decoders (ARM32, thumb2, AARCH64).

lisper · on Nov 1, 2016

This. In today's world it is easy to support multiple instruction sets in the same silicon using the same registers and ALU. So why in the name of all that is holy does the x86 not have a nice clean 64-bit instruction set that you can swap in? It's all RISC under the hood anyway, why not give us access to it? The only explanation I can think of is that Intel wants to keep things complicated as a barrier to entry for competition.

pcwalton · on Nov 1, 2016

> It's all RISC under the hood anyway, why not give us access to it?

Presumably because it doesn't matter that much, so it isn't worth the investment for Intel.

It feels to me like people have trouble accepting that both of these are true: (1) RISC vs. x86 doesn't matter that much in practice; (2) technically speaking, RISC is a superior design to x86.

nullc · on Nov 1, 2016

> Presumably because it doesn't matter that much, so it isn't worth the investment for Intel.

Intel invested in a new instruction set (IA64). AMD did not.

The market chose AMD's offering.

pcwalton · on Nov 2, 2016

And the market also chose ARM's AArch64 instruction set, despite the ARM installed base being wider than x86 and a backwards incompatible break from the past. The secret? Supporting both ISAs and switching instruction sets on exceptions.

The lesson I take from history is that there is no need to maintain ISA compatibility as long as the old mode is accessible on a process level.

imtringued · on Nov 2, 2016

>The secret? Supporting both ISAs and switching instruction sets on exceptions.

The secret is that Apple controls their hardware/appstore and android apps are ISA independent.

adrianratnapala · on Nov 2, 2016

Whot?

Didn't the AMD64 architecture spank Itanium in the marketplace long before the iPhone became a big deal?

qb45 · on Nov 2, 2016

> It's all RISC under the hood anyway, why not give us access to it? The only explanation I can think of is that Intel wants to keep things complicated as a barrier to entry for competition.

It wasn't Intel but AMD that came up with x86-64. Intel (and HP) indeed wanted to give you a clean 64-bit design - Itanium. And yes, it was meant as a barrier to entry among other things. But AMD won because Itanic didn't run existing x86 software and generally turned out to be slower than Intel hyped.

As for the internal RISC, you won't get it because it changes from generation to generation (also between vendors) and these changes are part of why x86 keeps getting faster. Furthermore, AFAIK these internal ops aren't as dense as x86 to make them simpler to decode and fetching them from memory would make bandwidth more of a bottleneck.

gsnedders · on Nov 1, 2016

Related to this is the PPC615, an IBM project in the mid-90s, socket compatible with the original Pentium, and ran x86-32, PPC32, and PPC64 natively. (It got scrapped because all the signs looked like Intel was going to push IA-64 heavily, and rely on it's backwards compatibility for x86-32, hence an expectation x86-32 was soon to be legacy.)

setpatchaddress · on Nov 2, 2016

Do you have a reference for that? I've heard various things about the 615, one of which was that it wasn't ever a real project.

gens · on Nov 1, 2016

AFAIK the reason for amd64 being what it is are intels lawyers.

slashdev · on Nov 1, 2016

That makes little sense, could you elaborate? AMD could have done anything they wanted with amd64 instruction encodings.

gens · on Nov 2, 2016

Well, not lawyers directly. More through lack of cooperation (due to competitive mentality).

http://www.agner.org/optimize/blog/read.php?i=25

slashdev · on Nov 2, 2016

I don't see how that matters in this case, AMD took the lead on AMD64 and Intel had to copy them afterward. They really could have done anything they wanted. As someone else here mentioned it was probably more to re-use the existing investment in x86 decoders.

gens · on Nov 2, 2016

Amd64 encoding, as is the case with most x86 encodings, is longer then it should be. Due to the two companies not communicating/cooperating.

That is the one advantage that i think ARM has over x86/amd64. That ARM encoding is much shorter.

slashdev · on Nov 2, 2016

You just repeated your point without adding anything new, that doesn't help your argument.

gens · on Nov 3, 2016

They couldn't use the shorter code space. So no, they could not do whatever they wanted. (note that making a completely new instruction encoding scheme would fail for obvious reasons)

slashdev · on Nov 3, 2016

I disagree, 64bit mode did not have to have any kind of instruction compatibility. It could have been a completely new instruction encoding scheme. A 64bit process must run all 64bit code, there is no invoking 32bit code from a 64bit process on any modern OS for AMD64.

ant6n · on Nov 4, 2016

And adding to that, they could've probably come up with some sort of 32bit per instruction risc structure where the the makeup overlaps with the cisc x86 encoding. For example, pack the 16 bits of a normal x86 instruction into the first bits of the risc, then add second source register and a bit or to each operand to increase the number of registers.

bonzini · on Nov 1, 2016

Those instructions were hard to optimize because of partial flags stalls. In the end because of this and because of strength reduction compilers were not generating that many INC and DEC instructions.

alexanderstocko · on Nov 1, 2016

The REX prefix: http://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix

bogomipz · on Nov 2, 2016

I had a question about this statement in the artcile:

"The C calling convention on x86 systems specifies that callees need to save certain registers."

Practically speaking is this the prologue that that C run time - crt0.o provides automatically/implicitly?

aji · on Nov 2, 2016

no, the compiler does this when generating code for functions and function calls

crt0.o is the glue between how the kernel loads a program into memory and how main() expects things to work. it does basic setup tasks that can vary between platforms, but is generally things like collecting command line arguments and setting up the stack. it will also invoke exit() if main() returns, since that's how the kernel expects the process to be destroyed

bogomipz · on Nov 2, 2016

Thanks for the response.

When you say "how the kernel loads a program into memory" I assume you are referring to ld-linux.so.2? Is that correct?

I imagine then that ld-linux.so.2 calls __start in crt0.o and crt0.o jumps to main(). Is this correct?

cnvogel · on Nov 2, 2016

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.g... load_elf_binary()

maps the executable to memory, and if an elf interpreter (/lib/ld-linux.so) is specified (which it normally is) also the elf interpreter.

It then jumps to the entry point of the raw binary or of the elf interpreter.

bogomipz · on Nov 2, 2016

Thanks, right, if there is a PT_INTERP header in the binary that specifies ELF then load_elf_binary() will be called.

Its easy to sometimes conceptually think or talk about "a loader" as if its some standalone entity. However this can be misleading(at least to me anyway.) Because in reality its linked into every dynamically-linked binary, along with crt0.o which as the other poster mentioned takes care of some ABI requirements and the setting up of the stack.

For anyone else who might be interested this also an illuminating source code file to read, libc-start. Which would be part of crt0.o

http://repo.or.cz/glibc.git/blob/HEAD:/csu/libc-start.c#l105

I don't have to think this low level on any kind of regular basis but its a great thought exercise to do so from time to time.

qb45 · on Nov 2, 2016

> if an elf interpreter (/lib/ld-linux.so) is specified (which it normally is)

I think your "normally" means "in case of dynamically linked executables".