RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang
Zhejiang University
under review 2025

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale dataset comprising 62,351 pre-event and post-event remote sensing image pairs (spanning earthquakes, floods, wildfires, and more) paired with detailed change captions. Based on RSCC dataset, we develop a change caption benchmark and evaluate the performance of several state-of-the-art temporal MLLMs. Given the quantitative and qualitative results, we demonstrate the limitations of models' capability in complex temporal remote sensing image understanding. Our work aims to facilitate the training and evaluation of vision-language models on temporal remote sensing image understanding tasks.

RSCC Example
An example of RSCC.
Construction Pipeline
Construction pipeline.
Model Performance Comparison
Model N-Gram Contextual Avg_L
(#Activate Params) ROUGE(%)↑ METEOR(%)↑ BERT(%)↑ ST5-SCS(%)↑ (#Words)
BLIP-3 (3B)4.5310.8598.8344.05*456
  + Textual Prompt10.07 (+5.54↑)20.69 (+9.84↑)98.95 (+0.12↑)63.67 (+19.62↑)*302
      + Visual Prompt8.45 (-1.62↓)19.18 (-1.51↓)99.01 (+0.06↑)68.34 (+4.67↑)*354
Kimi-VL (3B)12.4716.9598.8351.3587
  + Textual Prompt16.83 (+4.36↑)25.47 (+8.52↑)99.22 (+0.39↑)70.75 (+19.40↑)108
      + Visual Prompt16.83 (+0.00)25.39 (-0.08↓)99.30 (+0.08↑)69.97 (-0.78↓)109
Phi-4-Multimodal (4B)4.091.4598.6034.557
  + Textual Prompt17.08 (+13.00↑)19.70 (+18.25↑)98.93 (+0.33↑)67.62 (+33.07↑)75
      + Visual Prompt17.05 (-0.03↓)19.09 (-0.61↓)98.90 (-0.03↓)66.69 (-0.93↓)70
Qwen2-VL (7B)11.029.9599.1145.5542
  + Textual Prompt19.04 (+8.02↑)25.20 (+15.25↑)99.01 (-0.10↓)72.65 (+27.10↑)84
      + Visual Prompt18.43 (-0.61↓)25.03 (-0.17↓)99.03 (+0.02↑)72.89 (+0.24↑)88
LLaVA-NeXT-Interleave (8B)12.5113.2999.1146.9957
  + Textual Prompt16.09 (+3.58↑)20.73 (+7.44↑)99.22 (+0.11↑)62.60 (+15.61↑)75
      + Visual Prompt15.76 (-0.33↓)21.17 (+0.44↑)99.24 (+0.02↑)65.75 (+3.15↑)88
LLaVA-OneVision (8B)8.4010.9798.6446.15*221
  + Textual Prompt11.15 (+2.75↑)19.09 (+8.12↑)98.85 (+0.21↑)70.08 (+23.93↑)*285
      + Visual Prompt10.68 (-0.47↓)18.27 (-0.82↓)98.79 (-0.06↓)69.34 (-0.74↓)*290
InternVL 3 (8B)12.7615.7799.3151.8464
  + Textual Prompt19.81 (+7.05↑)28.51 (+12.74↑)99.55 (+0.24↑)78.57 (+26.73↑)81
      + Visual Prompt19.70 (-0.11↓)28.46 (-0.05↓)99.51 (-0.04↓)79.18 (+0.61↑)84
Pixtral (12B)12.3415.9499.3449.3670
  + Textual Prompt19.87 (+7.53↑)29.01 (+13.07↑)99.51 (+0.17↑)79.07 (+29.71↑)97
      + Visual Prompt19.03 (-0.84↓)28.44 (-0.57↓)99.52 (+0.01↑)78.71 (-0.36↓)102
CCExpert (7B)7.614.3299.1740.8112
  + Textual Prompt8.71 (+1.10↑)5.35 (+1.03↑)99.23 (+0.06↑)47.13 (+6.32↑)14
      + Visual Prompt8.84 (+0.13↑)5.41 (+0.06↑)99.23 (+0.00)46.58 (-0.55↓)14
TEOChat (7B)7.865.7798.9952.6415
  + Textual Prompt11.81 (+3.95↑)10.24 (+4.47↑)99.12 (+0.13↑)61.73 (+9.09↑)22
      + Visual Prompt11.55 (-0.26↓)10.04 (-0.20↓)99.09 (-0.03↓)62.53 (+0.80↑)22
Qualitative Results 1
Visualization of qualitative results. Critical descriptions are colored in green while incorrect and hallucinated sentences/words are red.
Qualitative Results 2
Visualization of qualitative results. Critical descriptions are colored in green while incorrect and hallucinated sentences/words are red.
Win Rate Plot
Win-rate from QvQ-Max (ground truth) to all baseline models on RSCC subset.

BibTeX

@misc{rscc_chen_2025,
  title = {RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events},
  author = {Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang},
  howpublished = {\url{https://github.com/Bili-Sakura/RSCC}},
  year = {2025}
}