OVS results comparing three settings: text-only (zero-shot), visual-only, and the full RnS combining both. Visual support starts from a small set A of pixel-annotated images and is later expanded to a larger set B (A ⊆ B), covering more classes. Text-only support yields ambiguous predictions (e.g. rider misclassified as motorcycle); visual-only struggles when some classes lack support and can confuse similar objects. By retrieving support images relevant to each test image and fusing them with textual features, RnS remains robust under partial support and achieves accurate segmentation.
Abstract
Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
At a Glance
Zero-shot Open-Vocabulary Segmentation (OVS) falls significantly short of fully supervised approaches, held back by the coarse image-level supervision of vision-language models and the inherent ambiguity of text prompts alone. Retrieve and Segment (RnS) closes this gap by augmenting class names with a small set of pixel-annotated visual examples. At test time, RnS retrieves the support examples most relevant to each query image and uses them — together with textual features — to train a lightweight linear classifier directly on VLM features, in just a few gradient steps. This learned, per-query fusion of both modalities improves zero-shot baselines by up to 34% on average, while all support information is stored in a memory-efficient index that scales gracefully as the support set grows.
RnS is designed to be plug-and-play: it is compatible with multiple VLM backbones and operates seamlessly on plain patch-level features as well as on mask proposals. The method handles dynamic few-shot settings where visual or textual information may be absent for certain test classes, and improves over diverse OVS approaches by at least 14.1% on average — all while retaining full open-vocabulary generalization.
BibTeX
@article{aravanis2026retrieve,
title = {Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?},
author = {Aravanis, Tilemachos and Stojni{\'c}, Vladan and Psomas, Bill and Komodakis, Nikos and Tolias, Giorgos},
journal = {arXiv preprint arXiv:2602.23339},
year = {2026}
}