We tried to communicate the key ideas in the video released with the blog post. It shows how DeepSpeed and the ZeRO optimizer save memory, and shows exactly what happens during each iteration of training. It is quite different from standard data or model parallelism.
The ZeRO optimizer helps scale large models regardless of the model topology. It works equally well for wide or deep models. Please let us know if you have specific questions that we can address.
Oh sorry I didn't make it to the video because the blog post intro made me bounce straight to the paper. I agree the video is a big help versus what's given in the paper.
It looks like your approach plays the 'pebble counting' game described in the OpenAI article I linked. Or maybe you'd like to explain what's different.
What would really help in the video (and paper) is a grounded example (like Resnet10 or AlexNet or just a 2-layer MLP) and drawing the connection between GPU buffers and layers. I feel the video covers details of the memory savings in way too much precision while the intuition behind the method (and how it translates to a graphical model of a NN) is essentially absent.
This is great and looks very easy to use! I'd expect it to have a huge impact given how easy it makes for people to leverage a few or a few thousand GPUs. I do have a few questions, of course.
Is it getting a lot of internal use already (beyond the example we just heard about)?
Is it possible to do inference using a CPU and a lot of RAM using a model trained on multiple GPUs via DeepSpeed?
Does it work with TPUs right out of the box? It looks like maybe not - if not, any plans to support them?
Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?
> Is it getting a lot of internal use already (beyond the example we just heard about)?
We have hundreds of internal users of DeepSpeed using it to train production ready models, many of which have been already shipped.
> Is it possible to do inference using a CPU and a lot of RAM using a model trained on multiple GPUs via DeepSpeed?
It is definitely possible to do inference on CPU using a model trained on multiple GPUs via DeepSpeed. For models trained without model parallelism, this is straight forward. The tricky part is if the model was trained using model parallelism, which would require merging checkpoints corresponding to different pieces of the model into a single one.
> Does it work with TPUs right out of the box? It looks like maybe not - if not, any plans to support them?
The ZeRO technology is compatible with TPU or any accelerator in a cluster setting, but we have not tested it with the TPUs. It likely would require some small refactoring to get DeepSpeed to work with TPUs. We do not have any internal plans to support them yet, but of course completely open to contribution from the community.
> Can you use DeepSpeed to train using a lot of CPUs + ram rather than GPUs?
It is possible to use DeepSpeed to train using a lot of CPUs. The major limitation of the approach is that CPUs can be an order of magnitude slower than GPUs in terms of computational performance.
Looks super cool. Does it remove the need for manual gradient checkpointing?
Also curious if there are expected to be memory / speed improvements if you're using it on a single GPU or if most gains come from improved parallelism across devices.
We don't have an exact date, but, we plan to share more details in a later submission. If you want access, please send an email to [turing_ AT _microsoft _DOT_ com]. Remove underscores and spaces.
The ZeRO optimizer helps scale large models regardless of the model topology. It works equally well for wide or deep models. Please let us know if you have specific questions that we can address.