Dion: Distributed Orthogonal Updates for Scalable AI Training

Author: Mike Breault
Published: Mon 04 Aug 2025
Episode Link: None

An exploration of the Dion optimizer (Distributed Orthogonal Updates) and how it tackles the scalability bottlenecks of training giant models. We break down why orthonormal updates matter, why Muon’s dense-matrix approach struggles with sharded, multi-GPU deployments, and how Dion uses amortized power iteration with QR and Cholesky on distributed shards to deliver fast, communication-efficient updates. Learn about integration with PyTorch DDP, FSDP2, and tensor parallelism, rank-fract compression with error feedback, and the empirical gains in wall-clock time over AdamW and Muon at scale—plus what this could unlock for the future of AI training.

Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.

EachPod

EachPod

Dion: Distributed Orthogonal Updates for Scalable AI Training