Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale dataset comprising 62,351 pre-event and post-event remote sensing image pairs (spanning earthquakes, floods, wildfires, and more) paired with detailed change captions. Based on RSCC dataset, we develop a change caption benchmark and evaluate the performance of several state-of-the-art temporal MLLMs. Given the quantitative and qualitative results, we demonstrate the limitations of models' capability in complex temporal remote sensing image understanding. Our work aims to facilitate the training and evaluation of vision-language models on temporal remote sensing image understanding tasks.
| Model | N-Gram | Contextual | Avg_L | ||
|---|---|---|---|---|---|
| (#Activate Params) | ROUGE(%)β | METEOR(%)β | BERT(%)β | ST5-SCS(%)β | (#Words) |
| BLIP-3 (3B) | 4.53 | 10.85 | 98.83 | 44.05 | *456 |
| + Textual Prompt | 10.07 (+5.54β) | 20.69 (+9.84β) | 98.95 (+0.12β) | 63.67 (+19.62β) | *302 |
| + Visual Prompt | 8.45 (-1.62β) | 19.18 (-1.51β) | 99.01 (+0.06β) | 68.34 (+4.67β) | *354 |
| Kimi-VL (3B) | 12.47 | 16.95 | 98.83 | 51.35 | 87 |
| + Textual Prompt | 16.83 (+4.36β) | 25.47 (+8.52β) | 99.22 (+0.39β) | 70.75 (+19.40β) | 108 |
| + Visual Prompt | 16.83 (+0.00) | 25.39 (-0.08β) | 99.30 (+0.08β) | 69.97 (-0.78β) | 109 |
| Phi-4-Multimodal (4B) | 4.09 | 1.45 | 98.60 | 34.55 | 7 |
| + Textual Prompt | 17.08 (+13.00β) | 19.70 (+18.25β) | 98.93 (+0.33β) | 67.62 (+33.07β) | 75 |
| + Visual Prompt | 17.05 (-0.03β) | 19.09 (-0.61β) | 98.90 (-0.03β) | 66.69 (-0.93β) | 70 |
| Qwen2-VL (7B) | 11.02 | 9.95 | 99.11 | 45.55 | 42 |
| + Textual Prompt | 19.04 (+8.02β) | 25.20 (+15.25β) | 99.01 (-0.10β) | 72.65 (+27.10β) | 84 |
| + Visual Prompt | 18.43 (-0.61β) | 25.03 (-0.17β) | 99.03 (+0.02β) | 72.89 (+0.24β) | 88 |
| LLaVA-NeXT-Interleave (8B) | 12.51 | 13.29 | 99.11 | 46.99 | 57 |
| + Textual Prompt | 16.09 (+3.58β) | 20.73 (+7.44β) | 99.22 (+0.11β) | 62.60 (+15.61β) | 75 |
| + Visual Prompt | 15.76 (-0.33β) | 21.17 (+0.44β) | 99.24 (+0.02β) | 65.75 (+3.15β) | 88 |
| LLaVA-OneVision (8B) | 8.40 | 10.97 | 98.64 | 46.15 | *221 |
| + Textual Prompt | 11.15 (+2.75β) | 19.09 (+8.12β) | 98.85 (+0.21β) | 70.08 (+23.93β) | *285 |
| + Visual Prompt | 10.68 (-0.47β) | 18.27 (-0.82β) | 98.79 (-0.06β) | 69.34 (-0.74β) | *290 |
| InternVL 3 (8B) | 12.76 | 15.77 | 99.31 | 51.84 | 64 |
| + Textual Prompt | 19.81 (+7.05β) | 28.51 (+12.74β) | 99.55 (+0.24β) | 78.57 (+26.73β) | 81 |
| + Visual Prompt | 19.70 (-0.11β) | 28.46 (-0.05β) | 99.51 (-0.04β) | 79.18 (+0.61β) | 84 |
| Pixtral (12B) | 12.34 | 15.94 | 99.34 | 49.36 | 70 |
| + Textual Prompt | 19.87 (+7.53β) | 29.01 (+13.07β) | 99.51 (+0.17β) | 79.07 (+29.71β) | 97 |
| + Visual Prompt | 19.03 (-0.84β) | 28.44 (-0.57β) | 99.52 (+0.01β) | 78.71 (-0.36β) | 102 |
| CCExpert (7B) | 7.61 | 4.32 | 99.17 | 40.81 | 12 |
| + Textual Prompt | 8.71 (+1.10β) | 5.35 (+1.03β) | 99.23 (+0.06β) | 47.13 (+6.32β) | 14 |
| + Visual Prompt | 8.84 (+0.13β) | 5.41 (+0.06β) | 99.23 (+0.00) | 46.58 (-0.55β) | 14 |
| TEOChat (7B) | 7.86 | 5.77 | 98.99 | 52.64 | 15 |
| + Textual Prompt | 11.81 (+3.95β) | 10.24 (+4.47β) | 99.12 (+0.13β) | 61.73 (+9.09β) | 22 |
| + Visual Prompt | 11.55 (-0.26β) | 10.04 (-0.20β) | 99.09 (-0.03β) | 62.53 (+0.80β) | 22 |
@misc{rscc_chen_2025,
title = {RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events},
author = {Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang},
howpublished = {\url{https://github.com/Bili-Sakura/RSCC}},
year = {2025}
}