Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LLVM Backend for the VideoCore4, Raspberry Pi 2 VPU (github.com/christinaa)
129 points by DHowett on May 2, 2016 | hide | past | favorite | 53 comments


It bugs me that LLVM targets are not properly 'plug in'.

LLVM backends can be in a plugin, but you have to tell the frontend which target to use (so it can use the correct data-layout etc). Its the frontends that don't support plug-in targets. Clang, for example, hardcodes the supported targets.

A combination of using enums as a 'target triple' and lack of dynamic target registration conspire against easy-to-use out-of-tree targets :(


Out-of-tree targets for LLVM are generally a bad idea given how fast LLVM is changed. Every change could break the out-of-tree target, and being out-of-tree, there's no way the upstream developers could notice that.


The other side of this approach is that when the mainline developers do not have much love to a particular platform, it is dropped altogether with little to no remorse - see the fate of Microblaze support, or even the good ol' C backend.

I wish we had a more stable API (maybe with a transformation layer) for the backends that may not probably need much of the new features. If LLVM had a maintained transformation layer from the "current" SelectionDAG to a stable one, it would have helped a lot.


"A combination of using enums as a 'target triple' and lack of dynamic target registration conspire against easy-to-use out-of-tree targets :(

Have you considered this is a deliberate decision, with a goal of having most targets in-tree? :)


Which makes developing a large backend a lot of pointless busywork with patchsets touching parts of the codebase that really ought not need touching at all :(

(I'm someone who was stuck on 3.6 for ages because of the pain of keeping up with the trunk)


Why is this beneficial? It means you need to change several repositories to add one target triple. This may be a win in terms of in-tree stability, but it reduces accessibility to everyone else.


It is beneficial in that it discourages proprietary backends to LLVM which are not contributed back to LLVM. It is the same benefit GCC enjoys by using GPL.


Does Apple have some proprietary backends? I guess they have the resources to maintain such.


Yes, they definitely do.


Like what, nowadays? I know their ARM64 backend was, but what is now? Do they still maintain forks of them? (With more target data for their SoCs?)


Swift is still out-of-tree.


GPUs.


There is a reason proprietary backends would always be necessary.

Hardware companies are very touchy about the patents, and not publishing an ISA-related work may often be the only way to protect from the trolls.

Such companies would often encourage (privately or publicly) clean room 3rd party backends, and I'm pretty sure that the one we're discussing here is exactly of this kind. But they would be scared to death to allow any of their employees to even send a tiny patch to a compiler backend, if it could expose some microarchitecture knowledge that was not published.


Sounds like the LLVM project should be using the GPL.

This constant churn kind of turns the MIT license into the GPL but only for small companies and individuals who can't keep up with the churn. I guess that is why Apple likes LLVM.


It depends what you mean by "plugin": targets are configured at compile time. I don't think you can build LLVM and then dynamically load a backend as plugin.


You can load targets as plugins in llc :)

You cannot load targets as plugins in clang :(


What can you do with this?


To elaborate, I made this to develop an open source VPU side bootloader for Raspberry Pi because I was unhappy with the state of other C compilers targeting VC4 (I explained why in my blog). I haven't had the time to work on my firmware recently due to IRL events so I decided to publish the compiler.

Not publishing any of the firmware work yet since it can't boot ARM yet (but SDRAM init reliably works across all boards). Once I get ARM working to some extent, I'll probably clean the code up and publish it too.


Thanks for making this work! I'd given up on the LLVM backend completely due to showstopping issues I couldn't resolve, and had assumed it had vanished into the aether. Glad to know you found it useful!

Regarding the boot loader, have you seen my piface boot loader? http://cowlark.com/piface/ It sounds like what you have is way more sophisticated, as I never got SDRAM init working properly, but you never know --- there might be something useful there.

I was actually considering having another go at a VC4 compiler, this time with pcc; there's an OS I want to port. I'm really happy to know that I no longer have to!


Yeah I looked at it, my firmware mostly just aims at bringing up enough stuff to be able to boot ARM, which is essentially just:

  - Setting up exception vectors and enabling exceptions
  - Reclocking VPU from PLLC
  - UART initialization
  - SDRAM initialization
  - Copying an ARM blinker stub to 0x0
  - ARM power domain initialization
  - PLLB initialization
  - Enabling passthrough mapping for ARM
  - ARM AXI interface initilaization
I'm still trying to figure out what I'm missing in order to get it to work but I suspect it's related to not properly setting up the ARM PLL. I was going to port the RPi clock management driver from Linux but I don't have the time at the moment.

As far as the compiler goes, I think it works reasonably well, though I only tested it by compiling my own code with it, there may be things that cause it to error out (anything that involves the frame pointer like VLAs/some C++ features). Also code quality is not ideal since it still doesn't make use of conditional instructions (aside from conditional branches) and doesn't implement AnalyzeBranch to eliminate redundant branches.

I started cleaning up TableGen to turn multi-instruction asm prints into glue DAG in SelDAGtoDAG but I still have to do it for like 4 instructions, which is pretty much a requirement to have MC code emission.


It's been too long and I've forgotten too much about how LLVM works. The gcc port I linked to, BTW, claims to have VC4 binutils.

Random question: how did you do 64 bit arithmetic? I couldn't find any kind of add-with-carry or subtract-with-carry instruction. I was semi-resigning myself to have to do a compare-and-test as well as the add, which would have more than doubled the amount of work.

Also, nice work reverse engineering the ARM controller --- how did you get the info?


...hilariously, I've just checked my email, and I see a two-day old message from someone who's just got my previous attempt at a VC4 compiler working. This one was with gcc:

https://github.com/puppeh/vc4-toolchain


This is awesome! It seems like an important series of steps towards having a legitimately-open educational platform, rather than the somewhat asterisk-laden one that the Pi currently represents.


Great work. Very interested to see what you have going on the firmware side.


I outlined what I did in another comment, here's a log from my current firmware: http://crna.cc/vpu_bootlog.txt

I think I'm on the right track but I don't have ARM working yet, most likely due to clock misconfiguration. Can probably fix it when I have more time.

Sidenote, I wish #raspberrypi-internals was more active :(


Will you be getting this merged into LLVM upstream?


Don't think I can without first implementing MC emission and running LLVM unit tests on it.


This has some additional info from the author: http://crna.cc/


I get that it's a compiler backend for the GPU of the Raspi, but what does that mean? I mean, can you compile arbitrary C code to run on the GPU? Is this for making a GPU firmware? Divers? games? Bitcoin miners?


Any and all of the above.

Although the VideoCore in the Pi is particularly interesting because it's responsible for the early stages of the Pi's boot process (the ARM cores are actually turned off when you initially apply power). Right now, that's all a big Broadcom-proprietary binary chunk; good compilers would be the first step towards freeing that code.


Whatever you want. It runs on each GPU core. You can do a lot of DSP for example.


Lessen the burden from the relatively weak ARM core used as CPU. The VPU seems very potent.


The RPi can handle playing back video (1080p) very well. Raspbian comes with omxplayer which uses the GPU to play video.


I know, I meant for other tasks (people mentioned DSP like logic).


VC4 is used in all the Raspberry Pi boards, not just the RPi 2.


You can run C code on GPUs? I don't know anything about GPUs, but I thought they had a completely different programming model that requires different programming languages and paradigms.


VPU is not a GPU, it was designed for stuff like video decoding. It is just a simple RISC with some SIMD instructions.

QPU is a different beast, in VC4 it cannot even run arbitraty C code. In VC5 it can, but inefficiently.


Yes and no. You can definitely run arbitrary C code on AMD's GCN ISA if you're so inclined (and I expect every other modern GPU, but I know less about those...). There are all the usual assembly instructions, there are pointers, you can implement a stack, and so on. That doesn't mean that arbitrary C code will run fast :)

To fully use the computational power of the GPU, you have to make use of its parallelism. That means dealing with the fact that you can have hundreds or thousands of "waves" (things with a register file and a program counter) in flight simultaneously, and each "wave" corresponds to many (in AMD's case, 64) threads in the conventional sense.

It is the last part that makes the biggest difference compared to regular CPUs, because it changes how you have to think about control flow. If/else-statements must be compiled in such a way that the wave goes through both branches if the threads in the wave branch differently (if all threads branch the same way, you can of course skip the other branch).

The first part makes a big difference as well, of course. GPUs care far less about single-threaded performance, so there is no out-of-order or speculative execution, and the memory latency is high. When a wave has to wait, the latency is made up for by scheduling another wave instead. That is, there is a high level of what is called "hyper-threading" on the CPU.


They do. Or, at least they did. But, their capabilities have been improving rapidly for a couple decades now. To the point that shader compilers are integrating recent C++ features and some degrees of straight-up C++ support. You won't be able to magically run classic, single-threaded programs fast. But, you can start using advanced features you already know from CPU programming to write programs for the GPU.


Yup. Check out the VideoCore ISA:

http://www.broadcom.com/docs/support/videocore/VideoCoreIV-A...

Note that in this case, the author is writing the bootloader firmware so performance isn't a major concern, though.


This isn't the same, the document in question relates to the QPU ISA. The VPU ISA hasn't been officially documented but there were many projects that involved reverse engineering it. The VC4 ISA is documented here:

https://github.com/hermanhermitage/videocoreiv

The VPU is basically a general purpose RISC processor with some fancy vector instructions on top. In fact, most of the firmware that runs on it is written in C.


Ah okay, this explains a lot! I somehow thought that the initial boot happened on the QPU.



Yeah, I saw, I haven't looked at it in detail but I think yours probably works better than mine since you did comprehensive testing. My only tests involved compiling my own firmware code, but from what I can tell, it works well, I haven't ran into any bugs yet aside from what I outlined in the README.

The assembler/linker I'm using is not ideal, I want to get MC code emission working eventually. I saw that you mentioned limitations on ld/st, why not use lea for data?

For example:

  BB1_12:                                 # %sdram_clkman_update_end.exit2
	mov r0, 2114982312 # long
	ld r2, (r0)
	lea r0, .str8(pc) # PCrel load
	lea r1, __FUNCTION__.sdram_init_late(pc) # PCrel load
	bl xprintf


No reason I guess, except it means an extra instruction. I thought about adding a flag for "large model" compilation, but it's not been a priority so far.


How is this different than using gcc like they describe in this book https://jan.newmarch.name/RPi/?


The LLVM backend that has been written generates code that directly runs on the Videocore GPU, which also handles the early boot process. Your link is concerned with code that runs on the ARM core, and interfaces with the GPU via the existing code running on the GPU


Could this be used to give more direct access to the Raspberry Pi Camera subsystem?

Would appreciate any pointers, no pun intended initially.


Outstanding!

Is it possible to use the LLVMLinux patches to run the kernel directly on the VC4?


VC4 itself doesn't have a MMU per se, it supports very limited memory remap (like PPC BATs) so running a conventional kernel on it is probably not possible. Best bet would be to port an RTOS to it, but my current plan for my firmware pretty much involves halting the VPU once the ARM is running.


There's much more to porting Linux successfully than just having a working compiler for the target architecture.


does this have opencl support now?


No.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: