JEEM: Vision-Language Understanding in Four Arabic Dialects

Published: 06 May 2025, Last Modified: 29 May 2025VLMs4All 2025 PosterEveryoneRevisionsBibTeXCC BY-SA 4.0
Keywords: arabic, image captioning, visual question answering
TL;DR: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models on image captioning and visual question answering across four Arabic dialects (Jordanian, Emirati, Egyptian, Morrocan).
Abstract: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: **J**ordan, The **E**mirates, **E**gypt, and **M**orocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. We find that an impediment to this goal is the lack of reliable evaluation metrics.
Submission Number: 12
Loading