Keywords: arabic, image captioning, visual question answering
TL;DR: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models on image captioning and visual question answering across four Arabic dialects (Jordanian, Emirati, Egyptian, Morrocan).
Abstract: We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: **J**ordan, The **E**mirates, **E**gypt, and **M**orocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. We find that an impediment to this goal is the lack of reliable evaluation metrics.
Submission Number: 12
Loading