Even from the paper, it's hard to tell what this library actually does: Section ...

jeffra · on Feb 10, 2020

We tried to communicate the key ideas in the video released with the blog post. It shows how DeepSpeed and the ZeRO optimizer save memory, and shows exactly what happens during each iteration of training. It is quite different from standard data or model parallelism.

The ZeRO optimizer helps scale large models regardless of the model topology. It works equally well for wide or deep models. Please let us know if you have specific questions that we can address.

choppaface · on Feb 11, 2020

Oh sorry I didn't make it to the video because the blog post intro made me bounce straight to the paper. I agree the video is a big help versus what's given in the paper.

It looks like your approach plays the 'pebble counting' game described in the OpenAI article I linked. Or maybe you'd like to explain what's different.

What would really help in the video (and paper) is a grounded example (like Resnet10 or AlexNet or just a 2-layer MLP) and drawing the connection between GPU buffers and layers. I feel the video covers details of the memory savings in way too much precision while the intuition behind the method (and how it translates to a graphical model of a NN) is essentially absent.