An exploration of the Dion optimizer (Distributed Orthogonal Updates) and how it tackles the scalability bottlenecks of training giant models. We break down why orthonormal updates matter, why Muon’s dense-matrix approach struggles with sharded, multi-GPU deployments, and how Dion uses amortized power iteration with QR and Cholesky on distributed shards to deliver fast, communication-efficient updates. Learn about integration with PyTorch DDP, FSDP2, and tensor parallelism, rank-fract compression with error feedback, and the empirical gains in wall-clock time over AdamW and Muon at scale—plus what this could unlock for the future of AI training.
Note: This podcast was AI-generated, and sometimes AI can make mistakes. Please double-check any critical information.
Sponsored by Embersilk LLC