1. EachPod

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Author
Arjun Srivastava
Published
Fri 19 Jul 2024
Episode Link
https://arjunsriva.com/podcast/podcasts/2205.14135/

FlashAttention is a novel algorithm that addresses the efficiency of Transformer models by improving speed and memory efficiency through IO-awareness. It reduces the number of memory accesses by dividing data into smaller blocks and loading them into fast memory, achieving practical speedups and enabling training on longer sequences. The algorithm also incorporates recomputation during the backward pass to minimize memory usage, delivering significant improvements in training large models like BERT and GPT-2.

Read full paper: https://arxiv.org/abs/2205.14135

Tags: Deep Learning, Transformers, Systems and Performance

Share to: