The trouble with SPIR-V, 2022 edition

my123 · on May 23, 2022

> Modern CUDA uses explicit programmer-managed masks, which is powerful and takes advantage of their hardware specifics. But mis-using the mask can cause a deadlock, as divergent threads could simply never participate in a subgroup operation that expects them to, leaving the other threads to block forever. I can see why this solution leaves to be desired, as it just offloads the problem and the risk of misuse to the user.

Note, on Volta (2017, present on customer since Turing in 2018) onwards, Independent Thread Scheduling is present, with a separate instruction pointer per SIMT thread.

This allows to have atomics across different lanes of the same warp, as such providing the guarantees assumed by the C++ memory model. Quite some modern CUDA apps are starting to rely on that, and as such will not work on Pascal or earlier, nevermind other GPU vendors.

Cooperative Groups are very flexible in CUDA too.

https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#s...

As such, control flow is handled very differently on post-Volta GPUs compared to pre-Volta ones, with pre-Volta more akin to what AMD still does today.

g0b · on May 24, 2022

I'm in fact talking about post-Volta hardware there, but this is not about forward progress, I meant using __ballotsync() and getting it wrong (ie waiting on the __activemask() from outside an if, but only in one branch of the if, meaning some of the threads will never participate in the sync) will deadlock the GPU.

It's a powerful (since _different_ locations statically can sync with each other), but also risky abstraction to expose, as compared to GLSL where it's impossible to deadlock anything by using subgroup intrinsics.

my123 · on May 24, 2022

That's indeed a quite raw abstraction, but is way too powerful performance-wise to not expose...

g0b · on May 24, 2022

Perhaps it makes sense for CUDA to expose it, but it certainly can't make sense for SPIR-V which has to work for a variety of hardware, most of which doesn't do ITS

atq2119 · on May 24, 2022

The statement about C++ is somewhat misleading. C++ has multiple notions of forward progress of various strength, and post-Volta does not satisfy the one that you'd be used to from CPUs, which is concurrent forward progress. However, it does satisfy parallel forward progress.

my123 · on May 24, 2022

The C++ standard encourages, but doesn't require concurrent forward progress.

However, a guarantee for forward progress for diverged threads in a warp is provided for Volta.

yuri91 · on May 24, 2022

The article mentions WebAssembly as having the same issue as spir-v with structured control flow, but actually in Wasm it is quite a bit better, because you are allowed to break/continue from an arbitrarily nested block.

This allows you to convert any reducible CFG without losing runtime performance, and only pay a price for irreducible ones (which are somewhat rare).

Shameless plug: I wrote an article about solving the structured control flow problem in WebAssembly -> https://medium.com/leaningtech/solving-the-structured-contro...

atq2119 · on May 25, 2022

To be somewhat fair to SPIR-V, I suspect that backend compilers either trivially recover nonlocal breaks/continues using a form of jump threading, or are in a situation where they can't do it anyway because of how divergence/reconvergence is implemented (using explicit masking registers).

On the other hand, it is somewhat silly not to allow nonlocal breaks, when they are de facto supported because there are no restrictions on where you can return from a function.

atq2119 · on May 24, 2022

This is a very good introduction to the inherent difficulty that comes from trying to do SIMT execution / whole program vectorization while at the same time giving programmers the power of certain optimization tricks that punch through the SIMT abstraction and expose the underlying vector architecture (via subgroup/wave operations).

The title is somewhat misleading as this trouble isn't specific to SPIR-V. It is inherent to the field, and DXIL has the same problem. (Arguably it's worse there because Microsoft tends to be quite bad at properly specifying semantics of DXIL and DirectX more generally.)

g0b · on May 24, 2022

This grew out from a much rantier (and worse) version of this article, which talked about the hurdles I faced when considering SPIR-V codegen for our research compiler (Thorin).

For a while I wanted to rewrite it, and ultimately to properly discuss what I wanted to discuss, I had to write something introductory in a much broader sense. So it sort of organically grew from there, and that's why the title is now a bit weird, but I plan for the rest of the series to continue with SPIR-V as some sort of central reference point.

It's fair to say DXIL suffers from this class of troubles too, there's a series themaister where he describes the awfulness of the story over there, but I don't work with DX and I wanted to make as few comments on that as possible to avoid saying something wrong and unverified.

https://themaister.net/blog/2022/04/24/my-personal-hell-of-t...