Zero Bubble Pipeline Parallelism

Author: Arjun Srivastava
Published: Mon 08 Jul 2024
Episode Link: https://arjunsriva.com/podcast/podcasts/2401.10241/

Core idea is think about backward pass into two flows, one to compute grad wrt to parameters, and one to compute grad wrt to output of last layer,
schedule so that you are always working instead of waiting (bubble).

Read full paper: https://arxiv.org/abs/2401.10241

Tags: Systems and Performance, Deep Learning, Machine Learning

Share to:

EachPod

EachPod

Zero Bubble Pipeline Parallelism