ILIAS: Instance-Level Image retrieval At Scale
Giorgos Kordopatis-Zilos Vladan Stojnić Anna Manko Pavel Šuma Nikolaos-Antonios Ypsilantis Nikos Efthymiadis Zakaria Laskar Jiří Matas Ondřej Chum Giorgos Tolias
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical Univesity in Prague
Dataset intro
ILIAS is a large-scale test dataset for evaluation on Instance-Level Image retrieval At Scale. It is designed to support future research in image-to-image and text-to-image retrieval for particular objects and serves as a benchmark for evaluating representations of foundation or customized vision and vision-language models, as well as specialized retrieval techniques.
The dataset includes 1,000 object instances across diverse domains, with:
- 1,232 image queries, depicting query objects on clean or uniform background.
- 4,715 positive images, featuring the query objects in real-world conditions with clutter, occlusions, scale variations, and partial views.
- 1,000 text queries, providing fine-grained textual descriptions of the query objects.
- 100M distractors from YFCC100M to evaluate retrieval performance under large-scale settings, while asserting noise-free ground truth.
object instance. Manually collected 5,947 images -- 1,232 query images and 4,715 positives -- and 1,000 text queries
Retrieval in large-scale settings is achieved using 100M distractors from the YFCC100M dataset
Noise-free ground truth guaranteed by collecting objects non publicly available before 2014, the YFCC100M compilation date
foundation and legacy models are evaluated combined with linear adaptation and various re-ranking techniques
Benchmark
performance on image-to-image retrieval
name | year | arch | train | dims | dataset | data size | train res | test res | 5M | 100M |
---|---|---|---|---|---|---|---|---|---|---|
alexnet.tv_in1k (adapt) | 2012 | CNN | sup | 256 | in1k | 1M | 224 | 384 | 1.9 | 1.3 |
convnext_base.clip_laion2b_augreg (adapt) | 2022 | CN-B | vla | 640 | laion2b | 2B | 256 | 384 | 18.1 | 14 |
convnext_base.fb_in1k (adapt) | 2022 | CN-B | sup | 1024 | in1k | 1M | 288 | 384 | 3.9 | 2.7 |
convnext_base.fb_in22k (adapt) | 2022 | CN-B | sup | 1536 | in22k | 14M | 224 | 384 | 9.9 | 7.6 |
convnext_large_mlp.clip_laion2b_ft_soup_320 (adapt) | 2022 | CN-L | vla | 768 | laion2b | 2B | 320 | 512 | 22.9 | 18.3 |
convnext_large.fb_in1k (adapt) | 2022 | CN-L | sup | 1024 | in1k | 1M | 288 | 384 | 4.2 | 2.9 |
convnext_large.fb_in22k (adapt) | 2022 | CN-L | sup | 1536 | in22k | 14M | 288 | 384 | 9.1 | 6.9 |
cvnet_resnet101 (adapt) | 2022 | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.2 | 3.1 |
cvnet_resnet50 (adapt) | 2022 | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.5 | 2.6 |
deit3_base_patch16_224.fb_in1k (adapt) | 2021 | ViT-B | sup+dist | 768 | in1k | 1M | 224 | 384 | 2.7 | 1.8 |
deit3_large_patch16_224.fb_in1k (adapt) | 2021 | ViT-L | sup+dist | 1024 | in1k | 1M | 224 | 384 | 3.3 | 2.4 |
densenet169.tv_in1k (adapt) | 2016 | CNN | sup | 2048 | in1k | 1M | 224 | 384 | 2.9 | 2 |
dino_resnet50 (adapt) | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 4.1 | 2.9 |
dino_vitb16 (adapt) | 2021 | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 6.6 | 4.8 |
dinov2_vitb14 (adapt) | 2023 | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 15 | 12.1 |
dinov2_vitb14_reg (adapt) | 2024 | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 13.5 | 10.7 |
dinov2_vitl14 (adapt) | 2023 | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 18.8 | 15.3 |
dinov2_vitl14_reg (adapt) | 2024 | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 17.1 | 13.6 |
eva02_base_patch14_224.mim_in22k (adapt) | 2023 | ViT-B | ssl | 768 | in22k | 14M | 224 | 384 | 4.7 | 3.2 |
eva02_base_patch16_clip_224.merged2b (adapt) | 2023 | ViT-B | vla | 512 | merged2b | 2B | 224 | 384 | 11.7 | 8.7 |
eva02_large_patch14_224.mim_in22k (adapt) | 2023 | ViT-L | ssl | 1024 | in22k | 14M | 224 | 384 | 3.9 | 2.7 |
eva02_large_patch14_224.mim_m38m (adapt) | 2023 | ViT-L | ssl | 1024 | merged38m | 38M | 224 | 384 | 8.8 | 6.1 |
eva02_large_patch14_clip_336.merged2b (adapt) | 2023 | ViT-L | vla | 768 | merged2b | 2B | 336 | 512 | 20.9 | 16 |
hier_dino_vits16_sop (adapt) | 2023 | ViT-S | sup | 384 | sop | 60k | 224 | 384 | 5.1 | 3.6 |
inception_v4.tf_in1k (adapt) | 2017 | CNN | sup | 1536 | in1k | 1M | 299 | 512 | 1.5 | 1 |
moco_v3_resnet50 (adapt) | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.4 | 2.6 |
moco_v3_vitb (adapt) | 2021 | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 3.2 | 2.3 |
nasnetalarge.tf_in1k (adapt) | 2018 | CNN | sup | 4032 | in1k | 1M | 331 | 512 | 1.6 | 1 |
recall_512-resnet50 (adapt) | 2022 | R50 | sup | 512 | sop | 60k | 224 | 384 | 3.1 | 2.1 |
recall_512-vit_base_patch16_224_in21k (adapt) | 2022 | ViT-B | sup | 512 | sop | 60k | 224 | 384 | 7.3 | 5.3 |
resnet101.tv_in1k (adapt) | 2015 | R101 | sup | 2048 | in1k | 1M | 224 | 384 | 2.7 | 1.8 |
resnet50.tv_in1k (adapt) | 2015 | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 2.5 | 1.8 |
RN50.openai (adapt) | 2021 | R50 | vla | 1024 | opanai | 400M | 224 | 384 | 8.5 | 6 |
superglobal_resnet101 (adapt) | 2023 | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.5 | 3.2 |
superglobal_resnet50 (adapt) | 2023 | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 3 | 2.4 |
swav_resnet50 (adapt) | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 2.9 | 2.1 |
tf_efficientnet_b4.ns_jft_in1k (adapt) | 2019 | CNN | sup+dist | 1792 | in1k | 1M | 380 | 512 | 4.3 | 2.9 |
udon_64-vitb_clip_openai (adapt) | 2024 | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 9.2 | 6.7 |
udon_64-vitb_in21k_ft_in1k (adapt) | 2024 | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 7.3 | 5.3 |
unic_l (adapt) | 2024 | ViT-L | dist | 1024 | in1k | 1M | 518 | 512 | 15.3 | 11.7 |
unicom_vit_base_patch16_224 (adapt) | 2023 | ViT-B | dist | 768 | laion400m | 400M | 224 | 384 | 13.8 | 11.1 |
unicom_vit_base_patch16_gldv2 (adapt) | 2023 | ViT-B | sup | 768 | gldv2 | 400M | 512 | 724 | 4.1 | 3.3 |
unicom_vit_base_patch16_sop (adapt) | 2023 | ViT-B | sup | 768 | sop | 400M | 224 | 384 | 12.8 | 9.9 |
unicom_vit_large_patch14_224 (adapt) | 2023 | ViT-L | dist | 768 | laion400m | 400M | 224 | 384 | 17.7 | 13.8 |
unicom_vit_large_patch14_336 (adapt) | 2023 | ViT-L | dist | 768 | laion400m | 400M | 336 | 512 | 18.6 | 14.6 |
uscrr_64-vit_base_patch16_clip_224.openai (adapt) | 2023 | ViT-B | sup | 768 | uned | 2.8M | 224 | 724 | 6.4 | 4.3 |
vgg16.tv_in1k (adapt) | 2014 | CNN | sup | 512 | in1k | 1M | 224 | 384 | 2.3 | 1.6 |
vit_base_patch16_224.augreg_in1k (adapt) | 2020 | ViT-B | sup | 768 | in1k | 1M | 224 | 384 | 1.9 | 1.3 |
vit_base_patch16_224.augreg_in21k (adapt) | 2020 | ViT-B | sup | 768 | in21k | 14M | 224 | 384 | 6.2 | 4.4 |
vit_base_patch16_clip_224.metaclip_2pt5b (adapt) | 2024 | ViT-B | vla | 768 | 2pt5b | 2.5B | 224 | 384 | 12.7 | 9.4 |
vit_base_patch16_clip_224.openai (adapt) | 2021 | ViT-B | vla | 512 | opanai | 400M | 224 | 384 | 10.7 | 7.9 |
vit_base_patch16_siglip_224.webli (adapt) | 2023 | ViT-B | vla | 768 | webli | 10B | 224 | 384 | 19.4 | 15.7 |
vit_base_patch16_siglip_256.webli (adapt) | 2023 | ViT-B | vla | 768 | webli | 10B | 256 | 384 | 20.6 | 16.7 |
vit_base_patch16_siglip_384.webli (adapt) | 2023 | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 26.2 | 21.5 |
vit_base_patch16_siglip_512.webli (adapt) | 2023 | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 27.5 | 23 |
vit_large_patch14_clip_224.laion2b (adapt) | 2021 | ViT-L | vla | 768 | laion2b | 2B | 224 | 384 | 17.5 | 13.7 |
vit_large_patch14_clip_224.metaclip_2pt5b (adapt) | 2024 | ViT-L | vla | 1024 | 2pt5b | 2.5B | 224 | 384 | 21.7 | 16.9 |
vit_large_patch14_clip_224.openai (adapt) | 2021 | ViT-L | vla | 768 | opanai | 400M | 224 | 384 | 15.8 | 11.9 |
vit_large_patch14_clip_336.openai (adapt) | 2021 | ViT-L | vla | 768 | opanai | 400M | 336 | 512 | 19.9 | 15.2 |
vit_large_patch16_224.augreg_in21k (adapt) | 2020 | ViT-L | sup | 1024 | in21k | 14M | 224 | 384 | 7.3 | 5.3 |
vit_large_patch16_224.augreg_in21k_ft_in1k (adapt) | 2020 | ViT-L | sup | 1024 | in1k | 1M | 224 | 384 | 6.6 | 4.7 |
vit_large_patch16_384.augreg_in21k_ft_in1k (adapt) | 2020 | ViT-L | sup | 1024 | in1k | 1M | 384 | 512 | 8.7 | 6.4 |
vit_large_patch16_siglip_256.webli (adapt) | 2023 | ViT-L | vla | 1024 | webli | 10B | 256 | 384 | 26.3 | 21.8 |
vit_large_patch16_siglip_384.webli (adapt) | 2023 | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 34.3 | 28.9 |
vit_base_patch16_siglip_384.v2_webli (adapt) | 2025 | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 27.5 | 22.6 |
vit_base_patch16_siglip_384.v2_webli (adapt) | 2025 | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 28.6 | 23.5 |
vit_large_patch16_siglip_384.v2_webli (adapt) | 2025 | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 36.3 | 30.3 |
vit_large_patch16_siglip_512.v2_webli (adapt) | 2025 | ViT-L | vla | 1024 | webli | 10B | 512 | 724 | 37.3 | 31.3 |
alexnet.tv_in1k | 2012 | CNN | sup | 256 | in1k | 1M | 224 | 384 | 2 | 1.5 |
convnext_base.clip_laion2b_augreg | 2022 | CN-B | vla | 640 | laion2b | 2B | 256 | 384 | 10.7 | 7.9 |
convnext_base.fb_in1k | 2022 | CN-B | sup | 1024 | in1k | 1M | 288 | 384 | 2.8 | 2 |
convnext_base.fb_in22k | 2022 | CN-B | sup | 1536 | in22k | 14M | 224 | 384 | 8.9 | 6.4 |
convnext_large_mlp.clip_laion2b_ft_soup_320 | 2022 | CN-L | vla | 768 | laion2b | 2B | 320 | 512 | 12.7 | 9.6 |
convnext_large.fb_in1k | 2022 | CN-L | sup | 1024 | in1k | 1M | 288 | 384 | 3.2 | 2.2 |
convnext_large.fb_in22k | 2022 | CN-L | sup | 1536 | in22k | 14M | 288 | 384 | 8.6 | 6.6 |
cvnet_resnet101 | 2022 | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.9 | 3 |
cvnet_resnet50 | 2022 | R50 | sup | 2048 | gldv2 | 1M | 512 | 724 | 3.7 | 2.9 |
deit3_base_patch16_224.fb_in1k | 2021 | ViT-B | sup+dist | 768 | in1k | 1M | 224 | 384 | 1.9 | 1.2 |
deit3_large_patch16_224.fb_in1k | 2021 | ViT-L | sup+dist | 1024 | in1k | 1M | 224 | 384 | 2 | 1.5 |
densenet169.tv_in1k | 2016 | CNN | sup | 2048 | in1k | 1M | 224 | 384 | 3.2 | 2.4 |
dino_resnet50 | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.8 | 2.9 |
dino_vitb16 | 2021 | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 5 | 3.7 |
dinov2_vitb14 | 2023 | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 14.3 | 11.5 |
dinov2_vitb14_reg | 2024 | ViT-B | ssl | 768 | lvd142m | 142M | 518 | 724 | 11.8 | 9.4 |
dinov2_vitl14 | 2023 | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 18.5 | 15.3 |
dinov2_vitl14_reg | 2024 | ViT-L | ssl | 1024 | lvd142m | 142M | 518 | 724 | 15.9 | 12.7 |
eva02_base_patch14_224.mim_in22k | 2023 | ViT-B | ssl | 768 | in22k | 14M | 224 | 384 | 3.1 | 2.1 |
eva02_base_patch16_clip_224.merged2b | 2023 | ViT-B | vla | 512 | merged2b | 2B | 224 | 384 | 7.8 | 5.9 |
eva02_large_patch14_224.mim_in22k | 2023 | ViT-L | ssl | 1024 | in22k | 14M | 224 | 384 | 2.5 | 1.5 |
eva02_large_patch14_224.mim_m38m | 2023 | ViT-L | ssl | 1024 | merged38m | 38M | 224 | 384 | 6.7 | 4.7 |
eva02_large_patch14_clip_336.merged2b | 2023 | ViT-L | vla | 768 | merged2b | 2B | 336 | 512 | 13.6 | 10.9 |
hier_dino_vits16_sop | 2023 | ViT-S | sup | 384 | sop | 60k | 224 | 384 | 1.7 | 3.3 |
inception_v4.tf_in1k | 2017 | CNN | sup | 1536 | in1k | 1M | 299 | 512 | 3.3 | 1.1 |
moco_v3_resnet50 | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 2.5 | 2.6 |
moco_v3_vitb | 2021 | ViT-B | ssl | 768 | in1k | 1M | 224 | 384 | 1.7 | 1.9 |
nasnetalarge.tf_in1k | 2018 | CNN | sup | 4032 | in1k | 1M | 331 | 512 | 2.3 | 1 |
recall_512-resnet50 | 2022 | R50 | sup | 512 | sop | 60k | 224 | 384 | 6.8 | 1.6 |
recall_512-vit_base_patch16_224_in21k | 2022 | ViT-B | sup | 512 | sop | 60k | 224 | 384 | 2.7 | 5 |
resnet101.tv_in1k | 2015 | R101 | sup | 2048 | in1k | 1M | 224 | 384 | 2.3 | 1.9 |
resnet50.tv_in1k | 2015 | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 4.4 | 1.7 |
RN50.openai | 2021 | R50 | vla | 1024 | opanai | 400M | 224 | 384 | 4.5 | 3.2 |
superglobal_resnet101 | 2023 | R101 | sup | 2048 | gldv2 | 1M | 512 | 724 | 4.3 | 3.4 |
superglobal_resnet50 | 2023 | R50 | sup | 2048 | in1k | 1M | 224 | 384 | 2.2 | 2 |
swav_resnet50 | 2021 | R50 | ssl | 2048 | in1k | 1M | 224 | 384 | 3.8 | 1.7 |
tf_efficientnet_b4.ns_jft_in1k | 2019 | CNN | sup+dist | 1792 | in1k | 1M | 380 | 512 | 7.5 | 2.6 |
udon_64-vitb_clip_openai | 2024 | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 8.3 | 5.9 |
udon_64-vitb_in21k_ft_in1k | 2024 | ViT-B | sup | 768 | uned | 2.8M | 224 | 384 | 11.4 | 5.5 |
unic_l | 2024 | ViT-L | dist | 1024 | in1k | 1M | 518 | 512 | 13.8 | 8.9 |
unicom_vit_base_patch16_224 | 2023 | ViT-B | dist | 768 | laion400m | 400M | 224 | 384 | 3.7 | 11 |
unicom_vit_base_patch16_gldv2 | 2023 | ViT-B | sup | 768 | gldv2 | 400M | 512 | 724 | 12.2 | 3 |
unicom_vit_base_patch16_sop | 2023 | ViT-B | sup | 768 | sop | 400M | 224 | 384 | 18 | 9.1 |
unicom_vit_large_patch14_224 | 2023 | ViT-L | dist | 768 | laion400m | 400M | 224 | 384 | 17.8 | 13.8 |
unicom_vit_large_patch14_336 | 2023 | ViT-L | dist | 768 | laion400m | 400M | 336 | 512 | 5.7 | 13.9 |
uscrr_64-vit_base_patch16_clip_224.openai | 2023 | ViT-B | sup | 768 | uned | 2.8M | 224 | 724 | 3 | 3.8 |
vgg16.tv_in1k | 2014 | CNN | sup | 512 | in1k | 1M | 224 | 384 | 1.4 | 2.3 |
vit_base_patch16_224.augreg_in1k | 2020 | ViT-B | sup | 768 | in1k | 1M | 224 | 384 | 4.2 | 1 |
vit_base_patch16_224.augreg_in21k | 2020 | ViT-B | sup | 768 | in21k | 14M | 224 | 384 | 8.8 | 3 |
vit_base_patch16_clip_224.metaclip_2pt5b | 2024 | ViT-B | vla | 768 | 2pt5b | 2.5B | 224 | 384 | 5.9 | 6.6 |
vit_base_patch16_clip_224.openai | 2021 | ViT-B | vla | 512 | opanai | 400M | 224 | 384 | 14.1 | 4.2 |
vit_base_patch16_siglip_224.webli | 2023 | ViT-B | vla | 768 | webli | 10B | 224 | 384 | 14.6 | 11.2 |
vit_base_patch16_siglip_256.webli | 2023 | ViT-B | vla | 768 | webli | 10B | 256 | 384 | 19.3 | 11.5 |
vit_base_patch16_siglip_384.webli | 2023 | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 20.1 | 15.6 |
vit_base_patch16_siglip_512.webli | 2023 | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 11.8 | 16.6 |
vit_large_patch14_clip_224.laion2b | 2021 | ViT-L | vla | 768 | laion2b | 2B | 224 | 384 | 14.4 | 9.4 |
vit_large_patch14_clip_224.metaclip_2pt5b | 2024 | ViT-L | vla | 1024 | 2pt5b | 2.5B | 224 | 384 | 9 | 11.7 |
vit_large_patch14_clip_224.openai | 2021 | ViT-L | vla | 768 | opanai | 400M | 224 | 384 | 12.1 | 7 |
vit_large_patch14_clip_336.openai | 2021 | ViT-L | vla | 768 | opanai | 400M | 336 | 512 | 6 | 9.4 |
vit_large_patch16_224.augreg_in21k | 2020 | ViT-L | sup | 1024 | in21k | 14M | 224 | 384 | 5.1 | 4.6 |
vit_large_patch16_224.augreg_in21k_ft_in1k | 2020 | ViT-L | sup | 1024 | in1k | 1M | 224 | 384 | 7.2 | 3.6 |
vit_large_patch16_384.augreg_in21k_ft_in1k | 2020 | ViT-L | sup | 1024 | in1k | 1M | 384 | 512 | 18.8 | 5.3 |
vit_large_patch16_siglip_256.webli | 2023 | ViT-L | vla | 1024 | webli | 10B | 256 | 384 | 24.2 | 15.2 |
vit_large_patch16_siglip_384.webli | 2023 | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 24.4 | 19.6 |
vit_base_patch16_siglip_384.v2_webli | 2025 | ViT-B | vla | 768 | webli | 10B | 384 | 512 | 18.4 | 15.0 |
vit_base_patch16_siglip_384.v2_webli | 2025 | ViT-B | vla | 768 | webli | 10B | 512 | 724 | 18.6 | 15.4 |
vit_large_patch16_siglip_384.v2_webli | 2025 | ViT-L | vla | 1024 | webli | 10B | 384 | 512 | 24.6 | 19.9 |
vit_large_patch16_siglip_512.v2_webli | 2025 | ViT-L | vla | 1024 | webli | 10B | 512 | 724 | 25.3 | 20.8 |
performance on text-to-image retrieval
name | year | arch | dims | dataset | data size | train res | test res | 5M | 100M |
---|---|---|---|---|---|---|---|---|---|
RN50.openai | 2021 | R50 | 1024 | opanai | 400M | 224 | 384 | 2.3 | 1.5 |
vit_base_patch16_clip_224.openai | 2021 | ViT-B | 512 | opanai | 400M | 224 | 384 | 2.7 | 1.6 |
vit_large_patch14_clip_224.openai | 2021 | ViT-L | 768 | opanai | 400M | 224 | 384 | 6.7 | 4.6 |
vit_large_patch14_clip_336.openai | 2021 | ViT-L | 768 | opanai | 400M | 336 | 512 | 8.4 | 5.8 |
vit_large_patch14_clip_224.laion2b | 2021 | ViT-L | 768 | laion2b | 2B | 224 | 384 | 9.4 | 7.0 |
convnext_base.clip_laion2b_augreg | 2022 | CN-B | 640 | laion2b | 2B | 256 | 384 | 7.0 | 4.6 |
convnext_large_mlp.clip_laion2b_ft_soup_320 | 2022 | CN-L | 768 | laion2b | 2B | 320 | 512 | 11.5 | 8.1 |
eva02_base_patch16_clip_224.merged2b | 2023 | ViT-B | 512 | merged2b | 2B | 224 | 384 | 4.4 | 2.5 |
eva02_large_patch14_clip_336.merged2b | 2023 | ViT-L | 768 | merged2b | 2B | 336 | 512 | 10.6 | 7.2 |
vit_base_patch16_siglip_224.webli | 2023 | ViT-B | 768 | webli | 10B | 224 | 384 | 10.1 | 7.1 |
vit_base_patch16_siglip_256.webli | 2023 | ViT-B | 768 | webli | 10B | 224 | 384 | 10.3 | 7.5 |
vit_base_patch16_siglip_384.webli | 2023 | ViT-B | 768 | webli | 10B | 384 | 512 | 14.4 | 11.0 |
vit_base_patch16_siglip_512.webli | 2023 | ViT-B | 768 | webli | 10B | 512 | 724 | 14.6 | 11.1 |
vit_large_patch16_siglip_256.webli | 2023 | ViT-L | 1024 | webli | 10B | 256 | 384 | 16.4 | 12.8 |
vit_large_patch16_siglip_384.webli | 2023 | ViT-L | 1024 | webli | 10B | 384 | 512 | 22.2 | 18.1 |
vit_base_patch16_clip_224.metaclip_2pt5b | 2024 | ViT-B | 768 | 2pt5b | 2.5B | 224 | 384 | 7.6 | 4.9 |
vit_large_patch14_clip_224.metaclip_2pt5b | 2024 | ViT-L | 1024 | 2pt5b | 2.5B | 224 | 384 | 13.1 | 9.2 |
vit_base_patch16_siglip_384.v2_webli | 2025 | ViT-B | 768 | webli | 10B | 384 | 512 | 15.1 | 11.1 |
vit_base_patch16_siglip_512.v2_webli | 2025 | ViT-B | 768 | webli | 10B | 512 | 724 | 14.6 | 10.4 |
vit_large_patch16_siglip_384.v2_webli | 2025 | ViT-L | 1024 | webli | 10B | 384 | 512 | 23.7 | 18.6 |
vit_large_patch16_siglip_512.v2_webli | 2025 | ViT-L | 1024 | webli | 10B | 512 | 724 | 24.7 | 19.8 |
performance on image-to-image retrieval with re-ranking
name | year | type | global repr. | local repr. | top-NN | 100M | oracle |
---|---|---|---|---|---|---|---|
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (600) | 10k | 38.9 | 56.0 |
AMES + SigLIP2 (adapt) | 2024 | local | SigLIP2-L@512 (adapt) | AMES-bin-dist (100) | 1k | 38.4 | 62.7 |
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (100) | 10k | 36.7 | 56.0 |
AMES + SigLIP (adapt) | 2024 | local | SigLIP-L@384 (adapt) | AMES-bin-dist (100) | 1k | 35.6 | 56.0 |
AMES + DINOv2 (adapt) | 2024 | local | DINOv2-L (adapt) | AMES-bin-dist (100) | 1k | 21.8 | 34.0 |
AMES + OpenCLIP (adapt) | 2024 | local | OpenCLIP-CN-L@320 (adapt) | AMES-bin-dist (100) | 1k | 27.1 | 48.0 |
AMES + SigLIP | 2024 | local | SigLIP-L@384 | AMES-bin-dist (100) | 1k | 26.4 | 48.7 |
SP + SigLIP (adapt) | 2007 | local | SigLIP-L@384 (adapt) | DINOv2-B-reg + ITQ (100) | 1k | 30.5 | 56.0 |
SP + SigLIP | 2007 | local | SigLIP-L@384 | DINOv2-B-reg + ITQ (100) | 1k | 21.8 | 56.0 |
CS + SigLIP (adapt) | 2014 | local | SigLIP-L@384 (adapt) | DINOv2-B-reg + ITQ (100) | 1k | 32.5 | 56.0 |
CS + SigLIP | 2014 | local | SigLIP-L@384 | DINOv2-B-reg + ITQ (100) | 1k | 22.9 | 48.7 |
αQE1 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 33.7 | 56.9 |
αQE2 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 31.5 | 54.4 |
αQE5 + SigLIP (adapt) | 2019 | global | SigLIP-L@384 (adapt) | -- | full | 23.5 | 49.3 |
αQE1 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 22.1 | 44.7 |
αQE2 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 20.4 | 40.8 |
αQE5 + SigLIP | 2019 | global | SigLIP-L@384 | -- | full | 14.3 | 34.9 |
*(adapt) - Representations are linearly adapted via multi-domain learning on UnED
Explore the collected data for your instance-level research!
Browse ILIASGet in touch
Citation
If you find our project useful, please consider citing us:
@inproceeding{ilias2025, title={{ILIAS}: Instance-Level Image retrieval At Scale}, author={Kordopatis-Zilos, Giorgos and Stojnić, Vladan and Manko, Anna and Šuma, Pavel and Ypsilantis, Nikolaos-Antonios and Efthymiadis, Nikos and Laskar, Zakaria and Matas, Jiří and Chum, Ondřej and Tolias, Giorgos}, booktitle={Computer Vision and Pattern Recognition (CVPR)}, year={2025}, }
Results
Sumbit your results here:
If you have any further questions, please don't hesitate to reach out to kordogeo@fel.cvut.cz