The paper presents Long-CLIP, a model designed to address the short attention span of CLIP for text, allowing it to process longer descriptions and understand complex image-text relationships. Long-CLIP introduces two main strategies: knowledge-preserved stretching of positional embeddings and primary component matching during fine-tuning.
Long-CLIP significantly extends the text length without disrupting existing representations, improving recall rates on long and short caption retrieval tasks. Its plug-and-play nature enables integration into various downstream applications, showing promise in enhancing image generation models and opening up possibilities for realistic and detailed content creation.
Read full paper: https://arxiv.org/abs/2403.15378
Tags: Multimodal AI, Natural Language Processing, Computer Vision