LLVM Backend for the VideoCore4, Raspberry Pi 2 VPU

willvarfar · on May 2, 2016

It bugs me that LLVM targets are not properly 'plug in'.

LLVM backends can be in a plugin, but you have to tell the frontend which target to use (so it can use the correct data-layout etc). Its the frontends that don't support plug-in targets. Clang, for example, hardcodes the supported targets.

A combination of using enums as a 'target triple' and lack of dynamic target registration conspire against easy-to-use out-of-tree targets :(

krasin · on May 2, 2016

Out-of-tree targets for LLVM are generally a bad idea given how fast LLVM is changed. Every change could break the out-of-tree target, and being out-of-tree, there's no way the upstream developers could notice that.

sklogic · on May 3, 2016

The other side of this approach is that when the mainline developers do not have much love to a particular platform, it is dropped altogether with little to no remorse - see the fate of Microblaze support, or even the good ol' C backend.

I wish we had a more stable API (maybe with a transformation layer) for the backends that may not probably need much of the new features. If LLVM had a maintained transformation layer from the "current" SelectionDAG to a stable one, it would have helped a lot.

DannyBee · on May 2, 2016

"A combination of using enums as a 'target triple' and lack of dynamic target registration conspire against easy-to-use out-of-tree targets :(

Have you considered this is a deliberate decision, with a goal of having most targets in-tree? :)

willvarfar · on May 3, 2016

Which makes developing a large backend a lot of pointless busywork with patchsets touching parts of the codebase that really ought not need touching at all :(

(I'm someone who was stuck on 3.6 for ages because of the pain of keeping up with the trunk)

duaneb · on May 3, 2016

Why is this beneficial? It means you need to change several repositories to add one target triple. This may be a win in terms of in-tree stability, but it reduces accessibility to everyone else.

sanxiyn · on May 3, 2016

It is beneficial in that it discourages proprietary backends to LLVM which are not contributed back to LLVM. It is the same benefit GCC enjoys by using GPL.

crazysim · on May 3, 2016

Does Apple have some proprietary backends? I guess they have the resources to maintain such.

duaneb · on May 3, 2016

Yes, they definitely do.

gsnedders · on May 3, 2016

Like what, nowadays? I know their ARM64 backend was, but what is now? Do they still maintain forks of them? (With more target data for their SoCs?)

willvarfar · on May 3, 2016

Swift is still out-of-tree.

jevinskie · on May 3, 2016

GPUs.

sklogic · on May 3, 2016

There is a reason proprietary backends would always be necessary.

Hardware companies are very touchy about the patents, and not publishing an ISA-related work may often be the only way to protect from the trolls.

Such companies would often encourage (privately or publicly) clean room 3rd party backends, and I'm pretty sure that the one we're discussing here is exactly of this kind. But they would be scared to death to allow any of their employees to even send a tiny patch to a compiler backend, if it could expose some microarchitecture knowledge that was not published.

ashitlerferad · on May 3, 2016

Sounds like the LLVM project should be using the GPL.

This constant churn kind of turns the MIT license into the GPL but only for small companies and individuals who can't keep up with the churn. I guess that is why Apple likes LLVM.

Joky · on May 3, 2016

It depends what you mean by "plugin": targets are configured at compile time. I don't think you can build LLVM and then dynamically load a backend as plugin.

willvarfar · on May 3, 2016

You can load targets as plugins in llc :)

You cannot load targets as plugins in clang :(

pepijndevos · on May 2, 2016

What can you do with this?

christina_b · on May 2, 2016

To elaborate, I made this to develop an open source VPU side bootloader for Raspberry Pi because I was unhappy with the state of other C compilers targeting VC4 (I explained why in my blog). I haven't had the time to work on my firmware recently due to IRL events so I decided to publish the compiler.

Not publishing any of the firmware work yet since it can't boot ARM yet (but SDRAM init reliably works across all boards). Once I get ARM working to some extent, I'll probably clean the code up and publish it too.

david-given · on May 2, 2016

Thanks for making this work! I'd given up on the LLVM backend completely due to showstopping issues I couldn't resolve, and had assumed it had vanished into the aether. Glad to know you found it useful!

Regarding the boot loader, have you seen my piface boot loader? http://cowlark.com/piface/ It sounds like what you have is way more sophisticated, as I never got SDRAM init working properly, but you never know --- there might be something useful there.

I was actually considering having another go at a VC4 compiler, this time with pcc; there's an OS I want to port. I'm really happy to know that I no longer have to!

christina_b · on May 3, 2016

Yeah I looked at it, my firmware mostly just aims at bringing up enough stuff to be able to boot ARM, which is essentially just:

  - Setting up exception vectors and enabling exceptions
  - Reclocking VPU from PLLC
  - UART initialization
  - SDRAM initialization
  - Copying an ARM blinker stub to 0x0
  - ARM power domain initialization
  - PLLB initialization
  - Enabling passthrough mapping for ARM
  - ARM AXI interface initilaization

I'm still trying to figure out what I'm missing in order to get it to work but I suspect it's related to not properly setting up the ARM PLL. I was going to port the RPi clock management driver from Linux but I don't have the time at the moment.

As far as the compiler goes, I think it works reasonably well, though I only tested it by compiling my own code with it, there may be things that cause it to error out (anything that involves the frame pointer like VLAs/some C++ features). Also code quality is not ideal since it still doesn't make use of conditional instructions (aside from conditional branches) and doesn't implement AnalyzeBranch to eliminate redundant branches.

I started cleaning up TableGen to turn multi-instruction asm prints into glue DAG in SelDAGtoDAG but I still have to do it for like 4 instructions, which is pretty much a requirement to have MC code emission.

david-given · on May 3, 2016

It's been too long and I've forgotten too much about how LLVM works. The gcc port I linked to, BTW, claims to have VC4 binutils.

Random question: how did you do 64 bit arithmetic? I couldn't find any kind of add-with-carry or subtract-with-carry instruction. I was semi-resigning myself to have to do a compare-and-test as well as the add, which would have more than doubled the amount of work.

Also, nice work reverse engineering the ARM controller --- how did you get the info?

david-given · on May 2, 2016

...hilariously, I've just checked my email, and I see a two-day old message from someone who's just got my previous attempt at a VC4 compiler working. This one was with gcc:

https://github.com/puppeh/vc4-toolchain

khedoros · on May 2, 2016

This is awesome! It seems like an important series of steps towards having a legitimately-open educational platform, rather than the somewhat asterisk-laden one that the Pi currently represents.

hermanhermitage · on May 3, 2016

Great work. Very interested to see what you have going on the firmware side.

christina_b · on May 3, 2016

I outlined what I did in another comment, here's a log from my current firmware: http://crna.cc/vpu_bootlog.txt

I think I'm on the right track but I don't have ARM working yet, most likely due to clock misconfiguration. Can probably fix it when I have more time.

Sidenote, I wish #raspberrypi-internals was more active :(

ashitlerferad · on May 3, 2016

Will you be getting this merged into LLVM upstream?

christina_b · on May 3, 2016

Don't think I can without first implementing MC emission and running LLVM unit tests on it.

cnvogel · on May 2, 2016

This has some additional info from the author: http://crna.cc/

pepijndevos · on May 2, 2016

I get that it's a compiler backend for the GPU of the Raspi, but what does that mean? I mean, can you compile arbitrary C code to run on the GPU? Is this for making a GPU firmware? Divers? games? Bitcoin miners?

JonathonW · on May 2, 2016

Any and all of the above.

Although the VideoCore in the Pi is particularly interesting because it's responsible for the early stages of the Pi's boot process (the ARM cores are actually turned off when you initially apply power). Right now, that's all a big Broadcom-proprietary binary chunk; good compilers would be the first step towards freeing that code.

jdmoreira · on May 2, 2016

Whatever you want. It runs on each GPU core. You can do a lot of DSP for example.

agumonkey · on May 2, 2016

Lessen the burden from the relatively weak ARM core used as CPU. The VPU seems very potent.

ciroduran · on May 2, 2016

The RPi can handle playing back video (1080p) very well. Raspbian comes with omxplayer which uses the GPU to play video.

agumonkey · on May 2, 2016

I know, I meant for other tasks (people mentioned DSP like logic).

Narishma · on May 2, 2016

VC4 is used in all the Raspberry Pi boards, not just the RPi 2.

haberman · on May 2, 2016

You can run C code on GPUs? I don't know anything about GPUs, but I thought they had a completely different programming model that requires different programming languages and paradigms.

sklogic · on May 2, 2016

VPU is not a GPU, it was designed for stuff like video decoding. It is just a simple RISC with some SIMD instructions.

QPU is a different beast, in VC4 it cannot even run arbitraty C code. In VC5 it can, but inefficiently.

nhaehnle · on May 3, 2016

Yes and no. You can definitely run arbitrary C code on AMD's GCN ISA if you're so inclined (and I expect every other modern GPU, but I know less about those...). There are all the usual assembly instructions, there are pointers, you can implement a stack, and so on. That doesn't mean that arbitrary C code will run fast :)

To fully use the computational power of the GPU, you have to make use of its parallelism. That means dealing with the fact that you can have hundreds or thousands of "waves" (things with a register file and a program counter) in flight simultaneously, and each "wave" corresponds to many (in AMD's case, 64) threads in the conventional sense.

It is the last part that makes the biggest difference compared to regular CPUs, because it changes how you have to think about control flow. If/else-statements must be compiled in such a way that the wave goes through both branches if the threads in the wave branch differently (if all threads branch the same way, you can of course skip the other branch).

The first part makes a big difference as well, of course. GPUs care far less about single-threaded performance, so there is no out-of-order or speculative execution, and the memory latency is high. When a wave has to wait, the latency is made up for by scheduling another wave instead. That is, there is a high level of what is called "hyper-threading" on the CPU.

corysama · on May 2, 2016

They do. Or, at least they did. But, their capabilities have been improving rapidly for a couple decades now. To the point that shader compilers are integrating recent C++ features and some degrees of straight-up C++ support. You won't be able to magically run classic, single-threaded programs fast. But, you can start using advanced features you already know from CPU programming to write programs for the GPU.

TD-Linux · on May 2, 2016

Yup. Check out the VideoCore ISA:

http://www.broadcom.com/docs/support/videocore/VideoCoreIV-A...

Note that in this case, the author is writing the bootloader firmware so performance isn't a major concern, though.

christina_b · on May 2, 2016

This isn't the same, the document in question relates to the QPU ISA. The VPU ISA hasn't been officially documented but there were many projects that involved reverse engineering it. The VC4 ISA is documented here:

https://github.com/hermanhermitage/videocoreiv

The VPU is basically a general purpose RISC processor with some fancy vector instructions on top. In fact, most of the firmware that runs on it is written in C.

TD-Linux · on May 2, 2016

Ah okay, this explains a lot! I somehow thought that the initial boot happened on the QPU.

puppeh · on May 3, 2016

I did GCC too, https://github.com/puppeh/vc4-toolchain.

christina_b · on May 3, 2016

Yeah, I saw, I haven't looked at it in detail but I think yours probably works better than mine since you did comprehensive testing. My only tests involved compiling my own firmware code, but from what I can tell, it works well, I haven't ran into any bugs yet aside from what I outlined in the README.

The assembler/linker I'm using is not ideal, I want to get MC code emission working eventually. I saw that you mentioned limitations on ld/st, why not use lea for data?

For example:

  BB1_12:                                 # %sdram_clkman_update_end.exit2
	mov r0, 2114982312 # long
	ld r2, (r0)
	lea r0, .str8(pc) # PCrel load
	lea r1, __FUNCTION__.sdram_init_late(pc) # PCrel load
	bl xprintf

puppeh · on May 3, 2016

No reason I guess, except it means an extra instruction. I thought about adding a flag for "large model" compilation, but it's not been a priority so far.

0xdeadbeefbabe · on May 2, 2016

How is this different than using gcc like they describe in this book https://jan.newmarch.name/RPi/?

kenz0r · on May 2, 2016

The LLVM backend that has been written generates code that directly runs on the Videocore GPU, which also handles the early boot process. Your link is concerned with code that runs on the ARM core, and interfaces with the GPU via the existing code running on the GPU

peteforde · on May 3, 2016

Could this be used to give more direct access to the Raspberry Pi Camera subsystem?

Would appreciate any pointers, no pun intended initially.

mappu · on May 2, 2016

Outstanding!

Is it possible to use the LLVMLinux patches to run the kernel directly on the VC4?

christina_b · on May 3, 2016

VC4 itself doesn't have a MMU per se, it supports very limited memory remap (like PPC BATs) so running a conventional kernel on it is probably not possible. Best bet would be to port an RTOS to it, but my current plan for my firmware pretty much involves halting the VPU once the ARM is running.

q3k · on May 3, 2016

There's much more to porting Linux successfully than just having a working compiler for the target architecture.

mangix · on May 2, 2016

does this have opencl support now?

Narishma · on May 2, 2016