ILIAS: Instance-Level Image retrieval At Scale
Giorgos Kordopatis-Zilos Vladan Stojnić Anna Manko Pavel Šuma Nikolaos-Antonios Ypsilantis Nikos Efthymiadis Zakaria Laskar Jiří Matas Ondřej Chum Giorgos Tolias
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical Univesity in Prague
Dataset intro
ILIAS is a large-scale test dataset for evaluation on Instance-Level Image retrieval At Scale. It is designed to support future research in image-to-image and text-to-image retrieval for particular objects and serves as a benchmark for evaluating representations of foundation or customized vision and vision-language models, as well as specialized retrieval techniques.
The dataset includes 1,000 object instances across diverse domains, with:
- 1,232 image queries, depicting query objects on clean or uniform background.
- 4,715 positive images, featuring the query objects in real-world conditions with clutter, occlusions, scale variations, and partial views.
- 1,000 text queries, providing fine-grained textual descriptions of the query objects.
- 100M distractors from YFCC100M to evaluate retrieval performance under large-scale settings, while asserting noise-free ground truth.
object instance. Manually collected 5,947 images -- 1,232 query images and 4,715 positives -- and 1,000 text queries
Retrieval in large-scale settings is achieved using 100M distractors from the YFCC100M dataset
Noise-free ground truth guaranteed by collecting objects non publicly available before 2014, the YFCC100M compilation date
foundation and legacy models are evaluated combined with linear adaptation and various re-ranking techniques
Benchmark
performance on image-to-image retrieval Evaluation based on cosine similarity between global image representations extracted from frozen backbones.
name
name of the model |
year
year of model's release |
repo
source repository for model weights and code timm: pytorch-image-models library torchvision: torchvision library github: official github library |
arch
architecture of the model ViT-(S|B|L): small, base or large Vision Transformer CN-(B|L): base or large ConvNext R(50|101): ResNet 50 or 101 CNN: Convolutional Neural Networks |
train
training scheem used for model learning sup: supervised learning ssl: self-supervised learning dist: distillation vla: vision-language alignment |
dims
dimensionality of desriptors |
dataset
dataset used to train the model |
data size
size of the training dataset |
train res
size of the images used during training |
test res
size of the largest image side used during testing |
5M
mAP@1k on mini-ILIAS |
100M
mAP@1k on ILIAS |
---|---|---|---|---|---|---|---|---|---|---|---|
AlexNet | 2012 | torchvision | CNN | sup | 256 | in1k | 1M | 224 | 384 | 2.0 | 1.5 |
VGG16 | 2015 | torchvision | CNN | sup | 512 | in1k | 1M | 224 | 384 | 3.0 | 2.3 |
ResNet50 | 2016 | torchvision | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 2.3 | 1.7 |
ResNet101 | 2016 | torchvision | R101 | sup | 2048 | in1k | 1M | 224 | 384 | 2.7 | 1.9 |
DenseNet169 | 2016 | torchvision | CNN | sup | 2048 | in1k | 1M | 224 | 384 | 3.2 | 2.4 |
Inception-v4 | 2017 | torchvision | CNN | sup | 1536 | in1k | 1M | 299 | 512 | 1.7 | 1.1 |
NASNet | 2018 | torchvision | CNN | sup | 4032 | in1k | 1M | 331 | 512 | 1.7 | 1.0 |
EffNet | 2019 | timm | CNN | sup+dist | 1792 | in1k | 1M | 380 | 512 | 3.8 | 2.6 |
SWAV | 2020 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 2.2 | 1.7 |
ViT-B | 2021 | timm | ViT-B | sup | 768 | in1k | 1M | 224 | 384 | 1.4 | 1.0 |
ViT-B-in22k | 2021 | timm | ViT-B | sup | 768 | in21k | 14M | 224 | 384 | 4.2 | 3.0 |
ViT-L-in22k | 2021 | timm | ViT-L | sup | 1024 | in21k | 14M | 224 | 384 | 6.0 | 4.6 |
ViT-L | 2021 | timm | ViT-L | sup | 1024 | in1k | 1M | 224 | 384 | 5.1 | 3.6 |
ViT-L@384 | 2021 | timm | ViT-L | sup | 1024 | in1k | 1M | 384 | 512 | 7.2 | 5.3 |
OAI-CLIP-R50 | 2021 | github | R50 | vla | 1024 | opanai | 400M | 224 | 384 | 4.4 | 3.2 |
OAI-CLIP-B | 2021 | timm | ViT-B | vla | 512 | opanai | 400M | 224 | 384 | 5.9 | 4.2 |
OAI-CLIP-L | 2021 | timm | ViT-L | vla | 768 | opanai | 400M | 224 | 384 | 9.0 | 7.0 |
OAI-CLIP-L@336 | 2021 | timm | ViT-L | vla | 768 | opanai | 400M | 336 | 512 | 12.1 | 9.4 |
DINO-R50 | 2021 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.8 | 2.9 |
DINO-ViT-B | 2021 | github | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 5.0 | 3.7 |
MoCov3-R50 | 2021 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.3 | 2.6 |
MoCov3-ViT-B | 2021 | github | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 2.5 | 1.9 |
OpenCLIP-ViT-L | 2022 | timm | ViT-L | vla | 768 | laion2b | 2B | 224 | 384 | 11.8 | 9.4 |
ConvNext-B | 2022 | timm | CN-B | sup | 1024 | in1k | 1M | 288 | 384 | 2.8 | 2.0 |
ConvNext-B-in22k | 2022 | timm | CN-B | sup | 1536 | in22k | 14M | 224 | 384 | 3.2 | 6.4 |
ConvNext-L | 2022 | timm | CN-L | sup | 1024 | in1k | 1M | 288 | 384 | 8.9 | 2.2 |
ConvNext-L-in22k | 2022 | timm | CN-L | sup | 1536 | in22k | 14M | 288 | 384 | 8.6 | 6.6 |
OpenCLIP-CN-B | 2022 | timm | CN-B | vla | 640 | laion2b | 2B | 256 | 384 | 10.7 | 7.9 |
OpenCLIP-CN-L@320 | 2022 | timm | CN-L | vla | 768 | laion2b | 2B | 320 | 512 | 12.7 | 9.6 |
Recall@k-R50-SOP | 2022 | github | R50 | sup | 512 | sop | 60k | 224 | 384 | 2.3 | 1.6 |
Recall@k-ViT-B-SOP | 2022 | github | ViT-B | sup | 512 | sop | 60k | 224 | 384 | 6.8 | 5.0 |
CVNet-R50 | 2022 | github | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.7 | 2.9 |
CVNet-R101 | 2022 | github | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.9 | 3.0 |
DeiT3-B | 2022 | timm | ViT-B | sup+dist | 768 | in1k | 1M | 224 | 384 | 1.9 | 1.2 |
DeiT3-L | 2022 | timm | ViT-L | sup+dist | 1024 | in1k | 1M | 224 | 384 | 2.0 | 1.5 |
EVA-MIM-B | 2023 | timm | ViT-B | ssl | 768 | in22k | 14M | 224 | 384 | 3.1 | 2.1 |
EVA-MIM-L | 2023 | timm | ViT-L | ssl | 1024 | in22k | 14M | 224 | 384 | 2.5 | 1.5 |
EVA-MIM-L | 2023 | timm | ViT-L | ssl | 1024 | merged38m | 38M | 224 | 384 | 6.7 | 4.7 |
EVA-CLIP-B | 2023 | timm | ViT-B | vla | 512 | merged2b | 2B | 224 | 384 | 7.8 | 5.9 |
EVA-CLIP-L | 2023 | timm | ViT-L | vla | 768 | merged2b | 2B | 336 | 512 | 13.6 | 10.9 |
HIER-ViT-S-SOP | 2023 | github | ViT-S | sup | 384 | sop | 60k | 224 | 384 | 4.6 | 3.3 |
Unicom-B | 2023 | github | ViT-B | dist | 768 | laion400m | 400M | 224 | 384 | 13.8 | 11.0 |
Unicom-L | 2023 | github | ViT-L | dist | 768 | laion400m | 400M | 224 | 384 | 18.0 | 13.8 |
Unicom-L@336 | 2023 | github | ViT-L | dist | 768 | laion400m | 400M | 336 | 512 | 17.8 | 13.9 |
Unicom-B-GLDv2 | 2023 | github | ViT-B | sup | 768 | gldv2 | 400M | 512 | 724 | 3.7 | 3.0 |
Unicom-B-SOP | 2023 | github | ViT-B | sup | 768 | sop | 400M | 224 | 384 | 12.2 | 9.1 |
SG-R50 | 2023 | github | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.3 | 3.4 |
SG-R101 | 2023 | github | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.5 | 3.4 |
USCRR-CLIP | 2023 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 5.7 | 3.8 |
SigLIP-B | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 224 | 384 | 14.1 | 11.2 |
SigLIP-B@256 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 256 | 384 | 14.6 | 11.5 |
SigLIP-B@384 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 19.3 | 15.6 |
SigLIP-B@512 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 20.1 | 16.6 |
SigLIP-L@256 | 2023 | timm | ViT-L | vla | 1024 | webli | 10B | 256 | 384 | 18.8 | 15.2 |
SigLIP-L@384 | 2023 | timm | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 24.2 | 19.6 |
DINOv2-B | 2024 | github | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 14.3 | 11.5 |
DINOv2-L | 2024 | github | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 18.5 | 15.3 |
MetaCLIP-B | 2024 | timm | ViT-B | vla | 768 | 2pt5b | 2.5B | 224 | 384 | 8.8 | 6.6 |
MetaCLIP-L | 2024 | timm | ViT-L | vla | 1024 | 2pt5b | 2.5B | 224 | 384 | 14.4 | 11.7 |
DINOv2-B-reg | 2024 | github | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 11.8 | 9.4 |
DINOv2-L-reg | 2024 | github | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 15.9 | 12.7 |
UNIC-L | 2024 | github | ViT-L | dist | 1024 | in1k | 1M | 518 | 512 | 11.4 | 8.9 |
UDON-ViT-B | 2024 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 7.5 | 5.5 |
UDON-CLIP | 2024 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 8.3 | 5.9 |
SigLIP2-B@384 | 2025 | timm | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 18.4 | 15.0 |
SigLIP2-B@512 | 2025 | timm | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 18.6 | 15.4 |
SigLIP2-L@384 | 2025 | timm | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 24.6 | 19.9 |
SigLIP2-L@512 | 2025 | timm | ViT-L | vla | 1024 | webli | 10B | 512 | 724 | 25.3 | 20.8 |
PE-B | 2025 | timm | ViT-B | vla | 1024 | meta | 2.3B | 224 | 384 | 20.2 | 9.7 |
PE-L@336 | 2025 | timm | ViT-L | vla | 1024 | meta | 2.3B | 336 | 512 | 27.1 | 22.0 |
DINOv3-B | 2025 | github | ViT-B | ssl | 768 | lvd1689m | 1.7B | 768 | 768 | 26.4 | 22.0 |
DINOv3-L | 2025 | github | ViT-L | ssl | 1024 | lvd1689m | 1.7B | 768 | 768 | 31.1 | 26.5 |
Franca-L | 2025 | github | ViT-L | ssl | 1024 | laion600m | 600M | 224 | 384 | 9.7 | 7.6 |
performance on image-to-image retrieval with linear adaptation Evaluation based on cosine similarity between adapted image representations. Adaptation is performed via a linear layer learned on top of frozen backbones using supervised multi-domain learning on UnED.
name
name of the model |
year
year of model's release |
repo
source repository for model weights and code timm: pytorch-image-models library torchvision: torchvision library github: official github library |
arch
architecture of the model ViT-(S|B|L): small, base or large Vision Transformer CN-(B|L): base or large ConvNext R(50|101): ResNet 50 or 101 CNN: Convolutional Neural Networks |
train
training scheem used for model learning sup: supervised learning ssl: self-supervised learning dist: distillation vla: vision-language alignment |
dims
dimensionality of desriptors |
dataset
dataset used to train the model |
data size
size of the training dataset |
train res
size of the images used during training |
test res
size of the largest image side used during testing |
5M
mAP@1k on mini-ILIAS |
100M
mAP@1k on ILIAS |
---|---|---|---|---|---|---|---|---|---|---|---|
AlexNet | 2012 | torchvision | CNN | sup | 256 | in1k | 1M | 224 | 384 | 1.9 | 1.3 |
VGG16 | 2015 | torchvision | CNN | sup | 512 | in1k | 1M | 224 | 384 | 2.3 | 1.6 |
ResNet50 | 2016 | torchvision | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 2.5 | 1.8 |
ResNet101 | 2016 | torchvision | R101 | sup | 2048 | in1k | 1M | 224 | 384 | 2.7 | 1.8 |
DenseNet169 | 2016 | torchvision | CNN | sup | 2048 | in1k | 1M | 224 | 384 | 2.9 | 2.0 |
Inception-v4 | 2017 | torchvision | CNN | sup | 1536 | in1k | 1M | 299 | 512 | 1.5 | 1.0 |
NASNet | 2018 | torchvision | CNN | sup | 4032 | in1k | 1M | 331 | 512 | 1.6 | 1.0 |
EffNet | 2019 | timm | CNN | sup+dist | 1792 | in1k | 1M | 380 | 512 | 4.3 | 2.9 |
SWAV | 2020 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 2.9 | 2.1 |
ViT-B | 2021 | timm | ViT-B | sup | 768 | in1k | 1M | 224 | 384 | 1.9 | 1.3 |
ViT-B-in22k | 2021 | timm | ViT-B | sup | 768 | in21k | 14M | 224 | 384 | 6.2 | 4.4 |
ViT-L-in22k | 2021 | timm | ViT-L | sup | 1024 | in21k | 14M | 224 | 384 | 7.3 | 5.3 |
ViT-L | 2021 | timm | ViT-L | sup | 1024 | in1k | 1M | 224 | 384 | 6.6 | 4.7 |
ViT-L@384 | 2021 | timm | ViT-L | sup | 1024 | in1k | 1M | 384 | 512 | 8.7 | 6.4 |
OAI-CLIP-R50 | 2021 | github | R50 | vla | 1024 | opanai | 400M | 224 | 384 | 8.5 | 6.0 |
OAI-CLIP-B | 2021 | timm | ViT-B | vla | 512 | opanai | 400M | 224 | 384 | 10.7 | 7.9 |
OAI-CLIP-L | 2021 | timm | ViT-L | vla | 768 | opanai | 400M | 224 | 384 | 15.8 | 11.9 |
OAI-CLIP-L@336 | 2021 | timm | ViT-L | vla | 768 | opanai | 400M | 336 | 512 | 19.9 | 15.2 |
DINO-R50 | 2021 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 4.1 | 2.9 |
DINO-ViT-B | 2021 | github | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 6.6 | 4.8 |
MoCov3-R50 | 2021 | github | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.4 | 2.6 |
MoCov3-ViT-B | 2021 | github | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 3.2 | 2.3 |
OpenCLIP-ViT-L | 2022 | timm | ViT-L | vla | 768 | laion2b | 2B | 224 | 384 | 17.5 | 13.7 |
ConvNext-B | 2022 | timm | CN-B | sup | 1024 | in1k | 1M | 288 | 384 | 3.9 | 2.7 |
ConvNext-B-in22k | 2022 | timm | CN-B | sup | 1536 | in22k | 14M | 224 | 384 | 9.9 | 7.6 |
ConvNext-L | 2022 | timm | CN-L | sup | 1024 | in1k | 1M | 288 | 384 | 4.2 | 2.9 |
ConvNext-L-in22k | 2022 | timm | CN-L | sup | 1536 | in22k | 14M | 288 | 384 | 9.1 | 6.9 |
OpenCLIP-CN-B | 2022 | timm | CN-B | vla | 640 | laion2b | 2B | 256 | 384 | 18.1 | 14.0 |
OpenCLIP-CN-L@320 | 2022 | timm | CN-L | vla | 768 | laion2b | 2B | 320 | 512 | 22.9 | 18.3 |
Recall@k-R50-SOP | 2022 | github | R50 | sup | 512 | sop | 60k | 224 | 384 | 3.1 | 2.1 |
Recall@k-ViT-B-SOP | 2022 | github | ViT-B | sup | 512 | sop | 60k | 224 | 384 | 7.3 | 5.3 |
CVNet-R50 | 2022 | github | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.5 | 2.6 |
CVNet-R101 | 2022 | github | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.2 | 3.1 |
DeiT3-B | 2022 | timm | ViT-B | sup+dist | 768 | in1k | 1M | 224 | 384 | 2.7 | 1.8 |
DeiT3-L | 2022 | timm | ViT-L | sup+dist | 1024 | in1k | 1M | 224 | 384 | 3.3 | 2.4 |
EVA-MIM-B | 2023 | timm | ViT-B | ssl | 768 | in22k | 14M | 224 | 384 | 4.7 | 3.2 |
EVA-MIM-L | 2023 | timm | ViT-L | ssl | 1024 | in22k | 14M | 224 | 384 | 3.9 | 2.7 |
EVA-MIM-L | 2023 | timm | ViT-L | ssl | 1024 | merged38m | 38M | 224 | 384 | 8.8 | 6.1 |
EVA-CLIP-B | 2023 | timm | ViT-B | vla | 512 | merged2b | 2B | 224 | 384 | 11.7 | 8.7 |
EVA-CLIP-L | 2023 | timm | ViT-L | vla | 768 | merged2b | 2B | 336 | 512 | 20.9 | 16.0 |
HIER-ViT-S-SOP | 2023 | github | ViT-S | sup | 384 | sop | 60k | 224 | 384 | 5.1 | 3.6 |
Unicom-B | 2023 | github | ViT-B | dist | 768 | laion400m | 400M | 224 | 384 | 13.8 | 11.1 |
Unicom-L | 2023 | github | ViT-L | dist | 768 | laion400m | 400M | 224 | 384 | 17.7 | 13.8 |
Unicom-L@336 | 2023 | github | ViT-L | dist | 768 | laion400m | 400M | 336 | 512 | 18.6 | 14.6 |
Unicom-B-GLDv2 | 2023 | github | ViT-B | sup | 768 | gldv2 | 400M | 512 | 724 | 4.1 | 3.3 |
Unicom-B-SOP | 2023 | github | ViT-B | sup | 768 | sop | 400M | 224 | 384 | 12.8 | 9.9 |
SG-R50 | 2023 | github | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.8 | 2.8 |
SG-R101 | 2023 | github | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.5 | 3.2 |
USCRR-CLIP | 2023 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 6.4 | 4.3 |
SigLIP-B | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 224 | 384 | 19.4 | 15.7 |
SigLIP-B@256 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 256 | 384 | 20.6 | 16.7 |
SigLIP-B@384 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 26.2 | 21.5 |
SigLIP-B@512 | 2023 | timm | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 27.5 | 23.0 |
SigLIP-L@256 | 2023 | timm | ViT-L | vla | 1024 | webli | 10B | 256 | 384 | 26.3 | 21.8 |
SigLIP-L@384 | 2023 | timm | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 34.3 | 28.9 |
DINOv2-B | 2024 | github | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 15.0 | 12.1 |
DINOv2-L | 2024 | github | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 18.8 | 15.3 |
MetaCLIP-B | 2024 | timm | ViT-B | vla | 768 | 2pt5b | 2.5B | 224 | 384 | 12.7 | 9.4 |
MetaCLIP-L | 2024 | timm | ViT-L | vla | 1024 | 2pt5b | 2.5B | 224 | 384 | 21.7 | 16.9 |
DINOv2-B-reg | 2024 | github | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 13.5 | 10.7 |
DINOv2-L-reg | 2024 | github | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 17.1 | 13.6 |
UNIC-L | 2024 | github | ViT-L | dist | 1024 | in1k | 1M | 518 | 512 | 15.3 | 11.7 |
UDON-ViT-B | 2024 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 7.3 | 5.3 |
UDON-CLIP | 2024 | github | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 9.2 | 6.7 |
SigLIP2-B@384 | 2025 | timm | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 27.5 | 22.6 |
SigLIP2-B@512 | 2025 | timm | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 28.6 | 23.5 |
SigLIP2-L@384 | 2025 | timm | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 36.3 | 30.3 |
SigLIP2-L@512 | 2025 | timm | ViT-L | vla | 1024 | webli | 10B | 512 | 724 | 37.3 | 31.3 |
PE-B | 2025 | timm | ViT-B | vla | 1024 | meta | 2.3B | 224 | 384 | 20.2 | 16.1 |
PE-L@336 | 2025 | timm | ViT-L | vla | 1024 | meta | 2.3B | 336 | 512 | 39.6 | 33.4 |
DINOv3-B | 2025 | github | ViT-B | ssl | 768 | lvd1689m | 1.7B | 768 | 768 | 26.4 | 22.5 |
DINOv3-L | 2025 | github | ViT-L | ssl | 1024 | lvd1689m | 1.7B | 768 | 768 | 32.9 | 28.3 |
Franca-L | 2025 | github | ViT-L | ssl | 1024 | laion600m | 600M | 224 | 384 | 12 | 9 |
performance on image-to-image retrieval with re-ranking Initial ranking based on global image representations via exhaustive search, and refinement of image similarity based on methods relying on local or refined global descriptors. Evaluation based on the refined similarities.
name
name of the combination adapt: linearly adapted representations |
year
year of re-ranking method publication |
type
type of re-ranking |
global
global descriptors used for re-ranking adapt: linearly adapted representations |
local
local descriptors used for re-ranking and their number in parentheses |
top-NN
top nearest neighbors used for re-ranking |
100M
mAP@1k on full ILIAS |
oracle
oracle re-ranking on top-1k |
---|---|---|---|---|---|---|---|
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (600) | 10k | 38.9 | 56.0 |
AMES + SigLIP2 (adapt) | 2024 | local | SigLIP2-L@512 (adapt) | AMES-bin-dist (100) | 1k | 38.4 | 62.7 |
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (100) | 10k | 36.7 | 56.0 |
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (100) | 1k | 35.6 | 56.0 |
AMES + DINOv2 (adapt) | 2024 | local | DINOv2-L (adapt) | AMES-bin-dist (100) | 1k | 21.8 | 34.0 |
AMES + OpenCLIP (adapt) | 2024 | local | OpenCLIP-CN-L@320 (adapt) | AMES-bin-dist (100) | 1k | 27.1 | 48.0 |
AMES + SigLIP | 2024 | local | SigLIP-L@384 | AMES-bin-dist (100) | 1k | 26.4 | 48.7 |
SP + SigLIP (adapt) | 2007 | local | SigLIP-L@384 (adapt) | DINOv2-B-reg + ITQ (100) | 1k | 30.5 | 56.0 |
SP + SigLIP | 2007 | local | SigLIP-L@384 | DINOv2-B-reg + ITQ (100) | 1k | 21.8 | 56.0 |
CS + SigLIP (adapt) | 2014 | local | SigLIP-L@384 (adapt) | DINOv2-B-reg + ITQ (100) | 1k | 32.5 | 56.0 |
CS + SigLIP | 2014 | local | SigLIP-L@384 | DINOv2-B-reg + ITQ (100) | 1k | 22.9 | 48.7 |
αQE1 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 33.7 | 56.9 |
αQE2 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 31.5 | 54.4 |
αQE5 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 23.5 | 49.3 |
αQE1 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 22.1 | 44.7 |
αQE2 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 20.4 | 40.8 |
αQE5 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 14.3 | 34.9 |
performance on text-to-image retrieval Evaluation based on cosine similarity between the text query and db global image representations, extracted using textual and visual encoders of VLMs.
name
name of the model |
year
year of model's release |
repo
source repository for model weights and code timm: pytorch-image-models library hf: huggingface library oc: open-clip library |
arch
architecture of the model ViT-(S|B|L): small, base or large Vision Transformer CN-(B|L): base or large ConvNext R(50|101): ResNet 50 or 101 CNN: Convolutional Neural Networks |
dims
dimensionality of desriptors |
dataset
dataset used to train the model |
data size
size of the training dataset |
train res
size of the images used during training |
test res
size of the largest image side used during testing |
5M
mAP@1k on mini-ILIAS |
100M
mAP@1k on ILIAS |
---|---|---|---|---|---|---|---|---|---|---|
OAI-CLIP-R50 | 2021 | oc | R50 | 1024 | opanai | 400M | 224 | 384 | 2.3 | 1.5 |
OAI-CLIP-B | 2021 | timm+oc | ViT-B | 512 | opanai | 400M | 224 | 384 | 2.7 | 1.6 |
OAI-CLIP-L | 2021 | timm+oc | ViT-L | 768 | opanai | 400M | 224 | 384 | 6.7 | 4.6 |
OAI-CLIP-L@336 | 2021 | timm+oc | ViT-L | 768 | opanai | 400M | 336 | 512 | 8.4 | 5.8 |
OpenCLIP-ViT-L | 2022 | timm+oc | ViT-L | 768 | laion2b | 2B | 224 | 384 | 9.4 | 7.0 |
OpenCLIP-CN-B | 2022 | timm+oc | CN-B | 640 | laion2b | 2B | 256 | 384 | 7.0 | 4.6 |
OpenCLIP-CN-L@320 | 2022 | timm+oc | CN-L | 768 | laion2b | 2B | 320 | 512 | 11.5 | 8.1 |
EVA-CLIP-B | 2023 | timm+oc | ViT-B | 512 | merged2b | 2B | 224 | 384 | 4.4 | 2.5 |
EVA-CLIP-L | 2023 | timm+oc | ViT-L | 768 | merged2b | 2B | 336 | 512 | 10.6 | 7.2 |
SigLIP-B | 2023 | timm+hf | ViT-B | 768 | webli | 10B | 224 | 384 | 10.1 | 7.1 |
SigLIP-B@256 | 2023 | timm+hf | ViT-B | 768 | webli | 10B | 256 | 384 | 10.3 | 7.5 |
SigLIP-B@384 | 2023 | timm+hf | ViT-B | 768 | webli | 10B | 384 | 512 | 14.4 | 11.0 |
SigLIP-B@512 | 2023 | timm+hf | ViT-B | 768 | webli | 10B | 512 | 724 | 14.6 | 11.1 |
SigLIP-L@256 | 2023 | timm+hf | ViT-L | 1024 | webli | 10B | 256 | 384 | 16.4 | 12.8 |
SigLIP-L@384 | 2023 | timm+hf | ViT-L | 1024 | webli | 10B | 384 | 512 | 22.2 | 18.1 |
MetaCLIP-B | 2024 | timm+oc | ViT-B | 768 | 2pt5b | 2.5B | 224 | 384 | 7.6 | 4.9 |
MetaCLIP-L | 2024 | timm+oc | ViT-L | 1024 | 2pt5b | 2.5B | 224 | 384 | 13.1 | 9.2 |
SigLIP2-B@384 | 2025 | timm+hf | ViT-B | 768 | webli | 10B | 384 | 512 | 15.1 | 11.1 |
SigLIP2-B@512 | 2025 | timm+hf | ViT-B | 768 | webli | 10B | 512 | 724 | 14.6 | 10.4 |
SigLIP2-L@384 | 2025 | timm+hf | ViT-L | 1024 | webli | 10B | 384 | 512 | 23.7 | 18.6 |
SigLIP2-L@512 | 2025 | timm+hf | ViT-L | 1024 | webli | 10B | 512 | 724 | 24.7 | 19.8 |
PE-B | 2025 | timm+oc | ViT-B | 1024 | meta | 2.3B | 224 | 384 | 7.9 | 5.5 |
PE-L@336 | 2025 | timm+oc | ViT-L | 1024 | meta | 2.3B | 336 | 512 | 19.5 | 14.6 |
Explore the collected data for your instance-level research!
Browse ILIASGet in touch
Citation
If you find our project useful, please consider citing us:
@inproceeding{ilias2025, title={{ILIAS}: Instance-Level Image retrieval At Scale}, author={Kordopatis-Zilos, Giorgos and Stojnić, Vladan and Manko, Anna and Šuma, Pavel and Ypsilantis, Nikolaos-Antonios and Efthymiadis, Nikos and Laskar, Zakaria and Matas, Jiří and Chum, Ondřej and Tolias, Giorgos}, booktitle={Computer Vision and Pattern Recognition (CVPR)}, year={2025}, }
Results
Sumbit your results here:
If you have any further questions, please don't hesitate to reach out to kordogeo@fel.cvut.cz