1. EachPod

Zero Bubble Pipeline Parallelism

Author
Arjun Srivastava
Published
Mon 08 Jul 2024
Episode Link
https://arjunsriva.com/podcast/podcasts/2401.10241/

Core idea is think about backward pass into two flows, one to compute grad wrt to parameters, and one to compute grad wrt to output of last layer,
schedule so that you are always working instead of waiting (bubble).

Read full paper: https://arxiv.org/abs/2401.10241

Tags: Systems and Performance, Deep Learning, Machine Learning

Share to: