i-CIR: Instance-Level Composed Image Retrieval
Bill Psomas* George Retsinas* Nikos Efthymiadis Panagiotis Filntisis Yannis Avrithis Petros Maragos Ondrej Chum Giorgos Tolias
1 VRG CTU in Prague · 2 Athena Research Center · 3 National Technical University of Athens · 4 Hellenic Robotics Center of Excellence · 5 IARAI
Dataset
i-CIR is a benchmark that targets instance-level composed image retrieval. Each example pairs an object-centric image query with a free-form text modifier and asks for images showing that very instance under the requested change. The collection stays compact for reproducibility while still matching the difficulty of 40M distractors through careful negative mining.
Bill Psomas and George Retsinas contributed equally.
The benchmark stresses both modalities with:
- 202 object instances drawn from landmarks, consumer goods, fictional characters, and tech devices.
- 1,883 composed queries mixing the image prompt with modifiers that touch appearance, context, attributes, and viewpoint.
- 750K database images including positives plus visual, textual, and composed hard negatives that emulate web-scale distractors.
- BASIC method that fuses frozen text and image similarities to set the new state of the art without extra training.
Method
BASIC is a training-free composed retrieval method that treats the image and text cues as a logical AND. We start from VLM embeddings, clean them with modality-specific centering, project images into an object-focused subspace, and contextualize the text to match VLM's caption distribution. The two refined views are scored independently and fused multiplicatively so that only candidates satisfying both cues surface.
- Centering & projection. Subtracting global means learned from LAION and projecting with object vs. style corpora (C+, C−) suppresses background and stylistic bias while keeping instance content.
- Query conditioning. Optional visual query expansion and caption-like text prompts align each modality with CLIP's training regime, stabilising the representations of short phrases.
- Normalized fusion. Min-based score scaling plus a Harris-inspired penalty favours candidates activated by both modalities and tames dominant single-modality hits.
- Deployment ready. Runs directly on frozen CLIP features and FAISS indexes—no finetuning, no backprop, and all query-side tweaks happen on the fly.
object instances curated across diverse domains.
queries pairing an image prompt with a text modification.
images distributed across positives and hard negatives.
median curated negatives per instance across visual, textual, and composed variants.
Benchmark
Get the full benchmark package including annotations, queries, and hard negatives.
Download i-CIRGet in touch
Citation
If you find our project useful, please consider citing us:
@inproceedings{icir2025, title={Instance-Level Composed Image Retrieval}, author={Psomas, Bill and Retsinas, George and Efthymiadis, Nikos and Filntisis, Panagiotis and Avrithis, Yannis and Maragos, Petros and Chum, Ondrej and Tolias, Giorgos}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025}, }
If you have any further questions, please reach out to vasileios.psomas@fel.cvut.cz