LLVM doesn't actually prefer the lower registers when doing register allocation. IIRC, sunfish told me that GCC doesn't either. It would be interesting to add features that try to minimize code size to the register allocator, but no compiler I know of actually does this.
Partially as a consequence of this, the REX prefixes take up a lot of space in most x86-64 instruction streams. In fact, the average size of each instruction is almost exactly 4 bytes, exactly the same as in classic 32-bit RISC architectures. (This is why I dislike it when people link to that old Linus post about how x86 is better than RISC architectures because of code size; it may have been true then, but not now.)
The greedy register allocator uses the CostPerUse hints to prefer the lower registers, but it has to balance many optimization goals at once, so it is not always obvious what is going on. For example, it will:
- Defer using a callee-saved register for the first time until it is worth the cost of spilling it in the prologue. It won't create a prolog spill just to use a better register.
- Try to reuse function argument and return value registers to minimize register copies around function calls.
IIRC, the CostPerUse attribute gave a code size savings of around 1% on x86-64 and Thumb2. On both architectures, the effect is reduced by the fact that all register operands must be from the low register set before you can use the smaller encoding.
It's been about ten years since I did any kernel work, but last I checked Linux doesn't use most of the fancy features of the CPU to avoid trashing more registers than necessary. If you know you don't do any floating point operations, for example, you know you don't need to save/restore the floating point registers on interrupts. The same might go for things like SSE.
Those instructions tend to be larger, so it may partly explain the difference in average instruction size. The same explanation would apply to why the Radeon driver has a higher average.
It calls "objdump -d" on the binary, which is binutil's disassembler.
objdump prints each instruction out in the following format:
5dc: 67 80 7d 00 00 cmpb $0x0,0x0(%ebp)
The first part is the instruction offset; the second is the bytes that comprise the instruction; the third part is the instruction in AT&T assembly syntax.
The script uses regexs to cut off the front and back, then counts the number of bytes (in hex format) in the second part.
Then it calculates the average for all instructions.
You can get quite large code size savings by using the original 8 registers everywhere you can get away with 32bit instructions - which is quite often, since they zero the upper 32 bits. e.g. XORL eax, eax will zero 64bit rax. Compilers don't seem very good at this, but I find I can usually save 10% code size when coding assembly routines just by paying attention to instruction width and which registers are used where. With 64bit instructions it doesn't matter which registers you use, the rex prefix is used anyway.
The binary sizes are also similar when I last measured. Keep in mind that (a) ARM and AArch64 have quite a few addressing modes as well; (b) ARM has things like LDMIA/STMDB and AArch64's LDP/STP that compress function prologs and epilogs; (c) more registers means you don't have to spill as much; (d) three address instructions are often more compact than a MOV plus a two address instruction, which helps with typical register allocation algorithms.
Every time I've looked at a hot function across x86-64 and arm(64), x86 needs more instructions too. Really the only thing x86 does more in one instruction that compilers actually use is memory operands, but they'll readily split them if it reduces the total number of loads. Which is quite often. (probably a bad choice though with modern cores)
Lea too if your arithmetic fits in it... but then that has issue restrictions... Three operand and the various conditional stuff in arm64 is worth more overall.
Well, to be fair, LEA does allow for more compact encoding than RISC add/shift/add if the operations actually fit the template. But it's not enough to overcome the waste elsewhere. Most of the benefit of LEA is that it's three-address rather than two and that it doesn't clobber the flags, but those are solutions to problems that RISC doesn't have in the first place.
Not necessarily. For example except for LEA, they do 2-operand arithmetic. The only thing where x86 wins hands down is addressing modes and immediate moves.
I'll grant you immediate moves, but not addressing modes. It lacks the auto-increment/update possibilities that were in the pdp-11, vax, mc68k, and power(pc), and are great for cutting down on instruction counts in loops. (Hell, on the CDC6600, address register updates were the only way you could do loads and stores; many good ideas didn't survive into the impoverished x86 era.)
HHVM's register allocator has a few tricks to minimize code size. Among other things it tends to prefer the old regs to the new x64 ones. As you might imagine code size is a significant challenge for Facebook so these kind of optimizations tended to give nice wins.
"Because Facebook's entire code base is compiled down to a single binary executable, the company's deployment process is quite different from what you'd normally expect in a PHP environment. Rossi told me that the binary, which represents the entire Facebook application, is approximately 1.5GB in size."
N.B. the "Volatile?" column is specific to the Windows calling conventions. Under the Sys V calling conventions (i.e. what the world outside of Redmond uses), RDI and RSI are volatile (and used for passing the first two integer arguments).
The OP perpetuates the mistaken assumption that x86-64 looks as it does because it extends good ol' 32-bit x86 encodings, which one might assume still work so well that one could run 32-bit code in 64-bit mode and have it still work.
Which is not the case at all. Those REX prefix bytes used to be perfectly good 32-bit x86 instructions that now simply don't work in 64-bit mode with their original encodings. So the "compatibility" between 32-bit and 64-bit modes is mythical -- the Opteron could have had a nice shiny new 64-bit programming model that was far less confusing than the dog's breakfast that is x86-64, but just didn't.
This. In today's world it is easy to support multiple instruction sets in the same silicon using the same registers and ALU. So why in the name of all that is holy does the x86 not have a nice clean 64-bit instruction set that you can swap in? It's all RISC under the hood anyway, why not give us access to it? The only explanation I can think of is that Intel wants to keep things complicated as a barrier to entry for competition.
> It's all RISC under the hood anyway, why not give us access to it?
Presumably because it doesn't matter that much, so it isn't worth the investment for Intel.
It feels to me like people have trouble accepting that both of these are true: (1) RISC vs. x86 doesn't matter that much in practice; (2) technically speaking, RISC is a superior design to x86.
And the market also chose ARM's AArch64 instruction set, despite the ARM installed base being wider than x86 and a backwards incompatible break from the past. The secret? Supporting both ISAs and switching instruction sets on exceptions.
The lesson I take from history is that there is no need to maintain ISA compatibility as long as the old mode is accessible on a process level.
> It's all RISC under the hood anyway, why not give us access to it? The only explanation I can think of is that Intel wants to keep things complicated as a barrier to entry for competition.
It wasn't Intel but AMD that came up with x86-64. Intel (and HP) indeed wanted to give you a clean 64-bit design - Itanium. And yes, it was meant as a barrier to entry among other things. But AMD won because Itanic didn't run existing x86 software and generally turned out to be slower than Intel hyped.
As for the internal RISC, you won't get it because it changes from generation to generation (also between vendors) and these changes are part of why x86 keeps getting faster. Furthermore, AFAIK these internal ops aren't as dense as x86 to make them simpler to decode and fetching them from memory would make bandwidth more of a bottleneck.
Related to this is the PPC615, an IBM project in the mid-90s, socket compatible with the original Pentium, and ran x86-32, PPC32, and PPC64 natively. (It got scrapped because all the signs looked like Intel was going to push IA-64 heavily, and rely on it's backwards compatibility for x86-32, hence an expectation x86-32 was soon to be legacy.)
I don't see how that matters in this case, AMD took the lead on AMD64 and Intel had to copy them afterward. They really could have done anything they wanted. As someone else here mentioned it was probably more to re-use the existing investment in x86 decoders.
They couldn't use the shorter code space. So no, they could not do whatever they wanted. (note that making a completely new instruction encoding scheme would fail for obvious reasons)
I disagree, 64bit mode did not have to have any kind of instruction compatibility. It could have been a completely new instruction encoding scheme. A 64bit process must run all 64bit code, there is no invoking 32bit code from a 64bit process on any modern OS for AMD64.
And adding to that, they could've probably come up with some sort of 32bit per instruction risc structure where the the makeup overlaps with the cisc x86 encoding. For example, pack the 16 bits of a normal x86 instruction into the first bits of the risc, then add second source register and a bit or to each operand to increase the number of registers.
Those instructions were hard to optimize because of partial flags stalls. In the end because of this and because of strength reduction compilers were not generating that many INC and DEC instructions.
no, the compiler does this when generating code for functions and function calls
crt0.o is the glue between how the kernel loads a program into memory and how main() expects things to work. it does basic setup tasks that can vary between platforms, but is generally things like collecting command line arguments and setting up the stack. it will also invoke exit() if main() returns, since that's how the kernel expects the process to be destroyed
Thanks, right, if there is a PT_INTERP header in the binary that specifies ELF then load_elf_binary() will be called.
Its easy to sometimes conceptually think or talk about "a loader" as if its some standalone entity. However this can be misleading(at least to me anyway.) Because in reality its linked into every dynamically-linked binary, along with crt0.o which as the other poster mentioned takes care of some ABI requirements and the setting up of the stack.
For anyone else who might be interested this also an illuminating source code file to read, libc-start. Which would be part of crt0.o
Partially as a consequence of this, the REX prefixes take up a lot of space in most x86-64 instruction streams. In fact, the average size of each instruction is almost exactly 4 bytes, exactly the same as in classic 32-bit RISC architectures. (This is why I dislike it when people link to that old Linus post about how x86 is better than RISC architectures because of code size; it may have been true then, but not now.)