Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even from the paper, it's hard to tell what this library actually does: Section 5 in https://arxiv.org/pdf/1910.02054.pdf

The paper talks about parameter partitioning and overlapped communication, but doesn't actually give many details on how those things happen.

The library appears to be an implementation of some common algos for solving the 'pebble game,' as explained decently here: https://medium.com/tensorflow/fitting-larger-networks-into-m...

The essential point is that:

(1) model parallelism is hard to do and has historically been done manually to scale wide models across GPUs

(2) inter-GPU I/O is expensive for vanilla data-parallel jobs (that typically use naive mirroring strategies)

(3) researchers have figured out now how to 'compile' a deep model so that layers span GPUs and save on both memory usage and I/O

(4) so scaling wide models is still hard, but now we have better tools for deep models

Existing all-reduce-based data-parallel problems have already been well-studied (see e.g. https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf ), so it's really nice to see gains through new techniques.

Definitely like seeing this 'compilation' being wrapped up into a library. Just wish they did a better job of communicating key ideas.



We tried to communicate the key ideas in the video released with the blog post. It shows how DeepSpeed and the ZeRO optimizer save memory, and shows exactly what happens during each iteration of training. It is quite different from standard data or model parallelism.

The ZeRO optimizer helps scale large models regardless of the model topology. It works equally well for wide or deep models. Please let us know if you have specific questions that we can address.


Oh sorry I didn't make it to the video because the blog post intro made me bounce straight to the paper. I agree the video is a big help versus what's given in the paper.

It looks like your approach plays the 'pebble counting' game described in the OpenAI article I linked. Or maybe you'd like to explain what's different.

What would really help in the video (and paper) is a grounded example (like Resnet10 or AlexNet or just a 2-layer MLP) and drawing the connection between GPU buffers and layers. I feel the video covers details of the memory savings in way too much precision while the intuition behind the method (and how it translates to a graphical model of a NN) is essentially absent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: