RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

Xing ZiUniversity of Technology Sydney
Jinghao XiaoUniversity of Technology Sydney
Yunxiao ShiUniversity of Technology Sydney
Xian TaoInstitute of Automation, CAS
Jun LiUniversity of Technology Sydney
Ali BrayteeUniversity of Technology Sydney
Mukesh PrasadUniversity of Technology Sydney
Abstract

Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces Remote Sensing Vision Language Model Question Answering (RSVLM-QA), a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA integrates several well-known RS segmentation and detection datasets, namely WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field. The dataset, generation code, and benchmark models are publicly available at https://github.com/StarZi0213/RSVLM-QA.

Dataset Generation Pipeline
RSVLM-QA Dataset Generation Pipeline Flowchart

Figure: Overview of the RSVLM-QA dataset generation pipeline, illustrating the multi-stage process from image description to VQA pair creation.

Generation & Evaluation Prompts

Explore the prompts used in our pipeline for generating various annotations and for evaluation. You can also view all prompts in Prompts.md on GitHub.
Comparison of Remote Sensing VQA and Image Captioning Datasets
Dataset Name# Images (Approx.)# VQA/Caps (Approx.)Annotation MethodCaptionCaption LengthObject Recog.Feature Underst.Spatial Rel.Precise CountExistence Judg.Answer Style (NL)
RSVLM-QA (Ours)13.8k162k VQAGPT-4.1+GT¹+Manual ReviewParagraph
RSVQA-HR10.7k1M+ VQAAlgorithmnacap
RSVQA-LR0.8k77k VQAAlgorithmnacap
CRSVQA1k10k VQAManualnacap
RSVQAxBEN159k1.5M+ VQAAlgorithmnacap
TAMMI2.2k22k VQAHybrid (Auto+Manual)nacap
LRS-VQA1.7k7k VQAManualnacap
VRSBench29.6k123k VQA + 30k Cap.GPT-4 + Manual ReviewParagraph
Captioning-focused Datasets (primarily for caption generation, not VQA tasks like Object Recog., Count, etc.)
RSICD10.9k55k CaptionsManualShort
UCM-Captions2.1k10.5k CaptionsCNN+RNNShort
Sydney-Captions0.6k3k CaptionsCNN+RNNShort
RSICap2.6k0.9k Caption PairsManualParagraph
ChatEarthNet173k173k CaptionsChatGPT + Manual ReviewParagraph
BRSIC10.9k55k Bilingual CaptionsManual+GPT+ManualShort
RSICC/LEVIR-CC5.1k10k Change CaptionsManualShort

Table Notes:

Symbols: ✓: Fully supports/present; ✗: Does not support/absent; ◑: Partially supports/limited extent; N/A Cap: No caption or not applicable for caption length; N/A Feat: Not a primary focus for this dataset type.

NL: Natural Language.

GT: Ground Truth.

(Class.): Counting presented as a classification task.

(Short): Answer style is short phrases/words, or caption is a short sentence.

(Cap.): Answer style refers to caption generation, not VQA response.

(Bilingual): Captions are provided in two languages.

(Change): Captions describe changes between bi-temporal images.

1: LLM for rich annotations & VQA; Ground Truth for precise counting.

2: Question types extremely limited (Land cover presence/name only).

3: Focus on large-size remote sensing imagery perception.

4: High-quality human-verified; Multi-task (Caption, VQA, Referring expressions).

5: High-quality detailed captions, but small scale.

6: Large scale; Focus on land cover description, not discrete objects.

7: Focus on describing changes between bi-temporal images.

Dataset Statistics

Overall Statistics

MetricValue
Total Images13,820
Total VQA Pairs162,373
Question Types6
Vocabulary Size (Unique Words)~5,700
Avg. Relations per Image5.63
Avg. Tags (Entities) per Image10.62
Avg. Question Length (words)9.23
Avg. Answer Length (words)18.80
Total Caption Sentences62,539
Avg. Sentences per Caption4.67
Avg. Caption Length (words)124.25

Structured Annotations (Relations and Tags)

TypeTotalAvg. per ImageMedianMaxMin
Relations77,8235.636191
Tags (Entities)146,81410.6211291

VQA Pair Distribution by Type

CategoryTotal VQA PairsAvg. Q LengthAvg. A Length
Spatial39,46710.5812.35
Quantity40,9149.2510.45
Presence27,6087.175.17
Features26,6349.9513.58
Objects13,93010.3010.81
Captions13,8207.00107.82
Total162,3739.2310.58*

* The 'Total' for 'Avg. A Length' excludes the 'Captions' category due to its significantly different answer length characteristic, providing a more representative average for other VQA types.

Question Type Distribution Visualized

Pie and Bar charts illustrating the distribution of VQA pairs across different question types.

VQA Pairs by Type (Pie Chart)

VQA Pairs by Type (Bar Chart)

Model Performance Benchmark

Average Scores (All Categories):

🥇 Ovis2: 70.2%🥈 InternVL3: 61.7%🥉 Qwen2.5-VL: 58.1%Gemma3: 58.0%LLaVA: 52.1%BLIP-2: 29.0%

Average Scores (Excluding Caption):

🥇 Ovis2: 66.8%🥈 Gemma3: 59.4%🥉 InternVL3: 57.7%Qwen2.5-VL: 53.2%LLaVA: 50.3%BLIP-2: 29.5%
Performance of various Vision Language Models (VLMs) on the RSVLM-QA dataset, categorized by different reasoning abilities including captioning. Scores represent accuracy (%) or evaluation score (0-100).