MVT: MASK-GROUNDED VISION-LANGUAGE MODELS FOR TAXONOMY-ALIGNED LAND-COVER TAGGING
*Equal contribution. †Corresponding authors.
Abstract
Land-cover understanding in remote sensing increasingly demands taxonomy-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under cross-dataset domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.
Method Overview
Overview of the MVT pipeline: mask-grounded region discovery, dual-step MLLM tuning, and evaluation.
Results
Tagging and description accuracy (LoveDA subset)
| Method | Level-1 | Level-1 Acc | Level-2 | Level-2 Acc | Description | OES |
|---|---|---|---|---|---|---|
| QW-S | 439 | 0.978 | 378 | 0.842 | 384.5 | 8.028 |
| QW-D | 433 | 0.964 | 400 | 0.891 | 406 | 8.278 |
| IVL-S | 405 | 0.902 | 331 | 0.737 | 357.5 | 7.306 |
| IVL-D | 442 | 0.984 | 379 | 0.844 | 408 | 8.212 |
| PIX-S | 204 | 0.454 | 174 | 0.388 | 185.5 | 3.765 |
| PIX-D | 247 | 0.550 | 204 | 0.454 | 218 | 4.470 |
Manual scoring statistics
| Method | Mean | Min | Max | Q1 | Q3 | Med | Var | Std |
|---|---|---|---|---|---|---|---|---|
| QW-D | 92.82 | 69 | 100 | 88.5 | 96.5 | 94.0 | 28.68 | 5.36 |
| IVL-D | 89.76 | 0 | 100 | 88.0 | 96.5 | 92.5 | 269.97 | 16.43 |
| PIX-D | 92.47 | 61 | 100 | 88.0 | 96.5 | 92.5 | 30.31 | 5.51 |
GPT-4o naturalness evaluation
| Method | Mean | Min | Max | Q1 | Q3 | Med | Var | Std |
|---|---|---|---|---|---|---|---|---|
| QW-D | 68.36 | 42.0 | 87.0 | 62.5 | 74.5 | 68.5 | 53.51 | 7.28 |
| IVL-D | 66.60 | 0.0 | 89.0 | 61.5 | 75.0 | 69.5 | 187.13 | 13.68 |
| PIX-D | 64.86 | 41.5 | 84.5 | 59.5 | 70.5 | 64.0 | 54.48 | 7.38 |
Paper
BibTeX
@article{@articlechen2025osda,
title={OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery},
author={Chen, Siyi and Wang, Kai and Pang, Weicong and Yang, Ruiming and Chen, Ziru and Gao, Renjun and Lau, Alexis Kai Hon and Gu, Dasa and Zhang, Chenchen and Li, Cheng},
journal={arXiv preprint arXiv:2509.18693},
year={2025}
}