The academic paper introduces Meta CLIP 2, a novel approach to training Contrastive Language-Image Pretraining (CLIP) models using a vast, worldwide dataset of image-text pairs. Traditionally, CLIP models have been trained primarily on English-only data, leading to performance limitations and a "curse of multilinguality" where multilingual models underperform their English counterparts. Meta CLIP 2 addresses these challenges by implementing a new recipe for data curation, metadata scaling, and a refined training framework that leverages non-English data to mutually benefit both English and non-English performance. The research demonstrates that by increasing model capacity (specifically using ViT-H/14) and scaling the number of seen training pairs, this curse can be broken, achieving state-of-the-art results across various English and multilingual benchmarks without relying on machine translation or proprietary data.