Translating Vision into Words: Advancing Object Recognition with Visual-Language Models

Translating Vision into Words: Advancing Object Recognition with Visual-Language Models

Keywords

Point CloudVisual Language ModelAIoTCLIPWiFiSmart Environments

Haruki YONEKURA , Hamada Rizk , Hirozumi Yamaguchi

The 22nd ACM International Conference on Mobile Systems, Applications, and Services (MobiSys'24 Posters)

This study proposes a novel method for automatic object recognition and classification in indoor environments using visual-language models (VLMs). Traditional methods face challenges due to the high cost of manually labeling objects and the ambiguity of textual descriptions. Our system learns from a large-scale dataset combining detailed 3D point cloud data and RGB images, enabling natural language-based object retrieval without predefined labels.
Our approach leverages CLIP (Contrastive Language-Image Pretraining) to integrate text and image representations, allowing for flexible, category-independent object recognition. Additionally, the system incorporates LiDAR sensors embedded in smartphones to capture environmental data and integrates WiFi Received Signal Strength Indicator (RSSI) to enhance recognition accuracy by utilizing wireless signal distribution. By embedding RSSI data into the 3D point cloud, the system improves the identification of objects that may not be directly visible.




Published Papers

  • 米倉 晴紀, Hamada Rizk, 山口 弘純, "Mobile Sensor-Based Indoor Object Searching with Visual-Language Model," 研究報告モバイルコンピューティングと新社会システム(MBL),2024-MBL-111,1-5 (2024-05-08), 2188-8817, https://ipsj.ixsq.nii.ac.jp/records/233963
  • Yonekura, H., Rizk, H., & Yamaguchi, H. (2024, June). Poster: Translating Vision into Words: Advancing Object Recognition with Visual-Language Models. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services (pp. 740-741). https://dl.acm.org/doi/10.1145/3643832.3661407

Environment-Aware Distributed Scheduling for Emergency LoRa Networks

Yuto Inaba, Tatsuya Amano, Akihito Hiromori, Hirozumi Yamaguchi

2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), SPT-IoT 2026, pp. 1366–1371

Disaster CommunicationLoRa +4

A Lightweight Vision-Language Model for Disaster Image Summarization

Hibiki Yoshizaki, Akira Uchiyama, Akihito Hiromori, Mineo Takai, Hirozumi Yamaguchi

2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), PerconAI 2026, pp. 1203–1208

Semantic CommunicationDisaster Response +4

Physics-Integrated Deep Learning for Urban Landslide Prediction

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), URBSENSE 2026, pp. 1094–1099

Landslide PredictionPhysics-Integrated Learning +3

Ray-Tracing-Driven Pattern-Based Vehicle Recognition in ISAC Radar

Heetae Jin, Akira Uchiyama

2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), PerRad 2026, pp. 328–333

ISACBeyond 5G +4

A Simulation Framework for Precision Formation Flying of Massive Satellite Swarms

Tatsuya Amano, Akihito Hiromori, Hirozumi Yamaguchi, Sumio Morioka

2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), PerVehicle , pp. 230–235

Satellite Formation FlyingDistributed Simulation +4

A Digital Twin Approach for Crowd Flow Modeling on Railway Station Platforms

Yu Yasuda, Tatsuya Amano and Hirozumi Yamaguchi

IEEE International Conference on Smart Computing (SMARTCOMP), pp. 82-89

DOI 10.1109/SMARTCOMP65954.2025.00069

Digital TwinCrowd Simulation +1