A Lightweight Vision-Language Model for Disaster Image Summarization
2026 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), PerconAI 2026, pp. 1203–1208
Abstract
During disasters, response agencies must rapidly obtain accurate situational awareness. Images of on-site conditions are useful for this purpose, but their large data size makes real-time aggregation from many locations difficult when communication infrastructure is degraded. We address this challenge by combining a disaster-ready wide-area wireless system (DR-IoT) with small edge devices deployed across sites. Each device locally summarizes captured images into concise text and transmits the text as a compressed proxy, enabling objective reporting and efficient multi-site data collection under strict bandwidth limits. We develop a lightweight model that runs on small devices and generates textual summaries of disaster scenes. We evaluate our model against existing lightweight captioning baselines in terms of output quality and model size. Results show that it achieves practical latency and competitive accuracy for disaster-focused summarization, indicating its suitability for deployment on IoT devices in real disaster settings.
Immediately after a large-scale disaster, response agencies must obtain accurate situational awareness quickly. Images of on-site conditions are a valuable source of information, but their size makes real-time aggregation from many locations impractical when communication infrastructure is degraded. Disaster-ready wide-area wireless systems such as LPWA and DR-IoT provide only tens of kbps — enough for text but not raw images.
We propose a two-stage reporting scheme: each edge device captures local scenes, summarizes them on-device into concise text, and transmits only the text as a compact proxy. The emergency operations center can then selectively request high-resolution images for scenes that require deeper analysis, allocating bandwidth and personnel where they matter most.
At the core of this system is a lightweight vision–language model designed to run on small edge devices. Targeted at disaster scene summarization, it is evaluated against existing lightweight captioning baselines in terms of output quality, model size, and latency. Results show that our model achieves practical latency on IoT-class devices while matching or exceeding baselines on disaster-domain summarization, demonstrating its suitability for real-world deployment.