VeCLIP: Improving CLIP Training via Visual-enriched Captions

Paper summary: Giant-scale web-crawled datasets are elementary for the success of pre-training vision-language fashions, reminiscent of CLIP. Nevertheless, the inherent noise and potential irrelevance of web-crawled AltTexts pose challenges in attaining exact image-text alignment. Current strategies using massive language fashions (LLMs) for caption rewriting have proven promise on small, curated datasets like CC3M and CC12M. This examine introduces a scalable pipeline for noisy caption rewriting. In contrast to latest LLM rewriting methods, we emphasize the incorporation of visible ideas into captions, termed as Visible-enriched Captions (VeCap). To make sure knowledge variety, we suggest a novel combined coaching scheme that optimizes the utilization of AltTexts alongside newly generated VeCap. We showcase the difference of this methodology for coaching CLIP on large-scale web-crawled datasets, termed VeCLIP. Using this cost-effective pipeline, we effortlessly scale our dataset as much as 300 million samples named VeCap dataset. Our outcomes present important benefits in image-text alignment and total mannequin efficiency. For instance, VeCLIP achieves as much as +25.2% achieve in COCO and Flickr30k retrieval duties underneath the 12M setting. For knowledge effectivity, VeCLIP achieves +3% achieve whereas solely utilizing 14% of the info employed within the vanilla CLIP and 11% in ALIGN. We additionally be aware the VeCap knowledge is complementary with different effectively curated datasets good for zero-shot classification duties. When combining VeCap and DFN, our mannequin can obtain robust efficiency on each of image-text retrieval and zero-shot classification duties, e.g. 83.1% accuracy@1 on ImageNet zero-shot for a H/14 mannequin.

Source link

VeCLIP: Improving CLIP Training via Visual-enriched Captions

ML/AI Platform Build vs Buy Decision: What Factors to Consider

Researchers leverage shadows to model 3D scenes, including objects blocked from view | MIT News

Conformer-Based Speech Recognition on Extreme Edge-Computing Devices

Hy-Tek Intralogistics and K.Hartwall Announce Partnership

Can AI Think Better by Breaking Down Problems? Insights from a Joint Apple and University of Michigan Study on Enhancing Large Language Models

Recommended For You

ML/AI Platform Build vs Buy Decision: What Factors to Consider

Researchers leverage shadows to model 3D scenes, including objects blocked from view | MIT News

Conformer-Based Speech Recognition on Extreme Edge-Computing Devices

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Understanding the visual knowledge of language models | MIT News

Can AI Think Better by Breaking Down Problems? Insights from a Joint Apple and University of Michigan Study on Enhancing Large Language Models

Using generative AI to improve software testing | MIT News

OpenAI and Elon Musk

Leave a Reply Cancel reply

A technique for more effective multipurpose robots | MIT News

Helping robots grasp the unpredictable | MIT News

The Current State of AI! (My Personal News Recap)

Robotics investments reach $418M in November 2023

2024 World Battery & Energy Storage Industry Expo (WBE)

MIT faculty, instructors, students experiment with generative AI in teaching and learning | MIT News

What is AI – Artificial Intelligence in Telugu | Future of AI | TeluguBadi

A method to enable safe mobile robot navigation in dynamic environments

Robot Talk Episode 90 – Robotically Augmented People

Eliminating Vector Quantization: Diffusion-Based Autoregressive AI Models for Image Generation

RBR50 Spotlight: Slip Robotics minimizes trailer loading times with simple approach

Voyage Multilingual 2 Embedding Evaluation | by Lars Wiik | Jun, 2024

Coval upgrades its CVGC Carbon Vacuum Gripper with an even more versatile second generation

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password

VeCLIP: Improving CLIP Training via Visual-enriched Captions

You might also like

Hy-Tek Intralogistics and K.Hartwall Announce Partnership

Can AI Think Better by Breaking Down Problems? Insights from a Joint Apple and University of Michigan Study on Enhancing Large Language Models

Recommended For You

Leave a Reply Cancel reply

CATEGORIES

SITE MAP

Welcome Back!

Retrieve your password