Keywords: Image Captioning, Concepts, Retrieval, RAG, Multilingual
TL;DR: Image captioning with concept and captions retrieval augmented generation.
Abstract: Multilingual vision-language models have advanced image captioning but still lag behind English models due to limited multilingual training data and expensive model scaling. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved example captions, reducing the need for large-scale multilingual training. However, RAG models often rely on English-translated captions, which can cause linguistic and cultural bias. We introduce CONCAP, a multilingual image captioning model that combines retrieved captions with image-specific concepts to better contextualize the image and improve cross-lingual grounding. Experiments on XM3600 show that CONCAP achieves strong performance with much less training data. These results highlight the value of concept-based retrieval in multilingual captioning and open avenues for cultural adaptation.
Submission Number: 13
Loading