Vision Language Models for Massive MIMO Semantic Communication

Stephen D. Liang

Vision Language Models for Massive MIMO Semantic Communication

Stephen D. Liang

24 Nov 2024 (modified: 28 Dec 2024)AAAI 2025 Workshop AI4WCN SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, Massive MIMO, Semantic Communication

TL;DR: Vision Language Models for Semantic Communication

Abstract: This paper presents a semantic communication scheme that utilizes the Vision-Language Models (VLMs) to enable efficient image transmission over Massive Multiple Input Multiple Output (MIMO) systems. By transmitting textual descriptions instead of raw image data, the proposed approach significantly reduces bandwidth usage while ensuring high-quality image reconstruction at the receiver. At the transmitter, a textual description of the image is generated using Bootstrapping Language-Image Pre-training (BLIP), converted to bits, modulated, and transmitted over the Massive MIMO channel. At the receiver, the transmitted text is used to reconstruct the image through a text-to-image generation model based on Stable Diffusion. We detail the system architecture, semantic communication framework, and evaluate the method's performance in terms of bandwidth efficiency, image reconstruction quality, and semantic similarity. Simulation results demonstrate that while semantic communication achieves excellent bandwidth efficiency, the image reconstruction quality, measured by structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR), is relatively low. However, the semantic similarity is exceptionally high, aligning with the primary objective of semantic communication.

Submission Number: 12

Loading