MVT: MASK-GROUNDED VISION-LANGUAGE MODELS FOR TAXONOMY-ALIGNED LAND-COVER TAGGING

HKUST · JHU · CUHK(SZ) · NUS · MUST
*Equal contribution. Corresponding authors.

Abstract

Land-cover understanding in remote sensing increasingly demands taxonomy-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under cross-dataset domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

Method Overview

MVT pipeline overview

Overview of the MVT pipeline: mask-grounded region discovery, dual-step MLLM tuning, and evaluation.

Results

Tagging and description accuracy (LoveDA subset)

Method Level-1 Level-1 Acc Level-2 Level-2 Acc Description OES
QW-S4390.9783780.842384.58.028
QW-D4330.9644000.8914068.278
IVL-S4050.9023310.737357.57.306
IVL-D4420.9843790.8444088.212
PIX-S2040.4541740.388185.53.765
PIX-D2470.5502040.4542184.470

Manual scoring statistics

Method Mean Min Max Q1 Q3 Med Var Std
QW-D92.826910088.596.594.028.685.36
IVL-D89.76010088.096.592.5269.9716.43
PIX-D92.476110088.096.592.530.315.51

GPT-4o naturalness evaluation

Method Mean Min Max Q1 Q3 Med Var Std
QW-D68.3642.087.062.574.568.553.517.28
IVL-D66.600.089.061.575.069.5187.1313.68
PIX-D64.8641.584.559.570.564.054.487.38

Paper

BibTeX

@article{@articlechen2025osda,
  title={OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery},
  author={Chen, Siyi and Wang, Kai and Pang, Weicong and Yang, Ruiming and Chen, Ziru and Gao, Renjun and Lau, Alexis Kai Hon and Gu, Dasa and Zhang, Chenchen and Li, Cheng},
  journal={arXiv preprint arXiv:2509.18693},
  year={2025}
}