Since the threads are relying on each other to fill the SRAM with all needed data if you didn’t wait then values would be missing.
There's also explicit warp synchronization, i.e. __syncwarp(). More on warp primitives here: https://developer.nvidia.com/blog/using-cuda-warp-level-prim...
Since the threads are relying on each other to fill the SRAM with all needed data if you didn’t wait then values would be missing.