HyperVLM: Hyperbolic Space Guided Vision Language Modeling for Hierarchical Multi-Modal Understanding

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Model, Vision Language Model, Machine Learning, Deep Learning
Abstract: State-of-the-art performance has been achieved in recent years on tasks such as search, recommendation and classification using Visuo-Lingual Multi-Modal models. While the pretrained Vision-Language models like Contrastive Language-Image Pre-training (CLIP) have achieved promising zero-shot performance on several generalized tasks by learning vision-language concepts in a common space, the natural hierarchical relationship between them remains unexplored. In this work we propose HyperVLM: a hyperbolic Poincaré geometry based vision-language model that learns joint text-image representation considering the hierarchical relation between the two. We compare the performance of HyperVLM with CLIP model for zero-shot image classification and retrieval tasks to demonstrate the efficacy of the proposed method. We also demonstrate the effectiveness of proposed method for retrieval task when applied to BLIP architecture's ITC loss module. Proposed method holds immense value for recommendation and search tasks.
Submission Number: 45
Loading