My GPU work is not in ML (deep or otherwise); but ... 1. "100 lines of CUDA" + P...

chillee · on March 17, 2024

> maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.

In this case it's pretty reasonable imo, since the kernel itself is fairly independent - the usage of torch is just for some bindings for the data structures.

> Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.

This is actually the standard way to do things in ML. Assuming you're from a HPC background (where this may seem quite strange), the biggest change is that "More or less everything in ML runs on the GPU", so there is very rarely any device to host synchronizations. In addition, each individual kernel is typically run on fairly large chunks of data (a million elements would be on the smaller side), so maximizing occupancy with streams is not as necessary as in HPC.