ILIAS:
Instance-Level Image retrieval At Scale

Giorgos Kordopatis-Zilos Vladan Stojnić Anna Manko Pavel Šuma Nikolaos-Antonios Ypsilantis Nikos Efthymiadis Zakaria Laskar Jiří Matas Ondřej Chum Giorgos Tolias
Visual Recognition Group, Faculty of Electrical Engineering, Czech Technical Univesity in Prague

Dataset intro

ILIAS is a large-scale test dataset for evaluation on Instance-Level Image retrieval At Scale.
It is designed to support future research in image-to-image and text-to-image retrieval for particular objects and serves as a benchmark for evaluating representations of foundation or customized vision and vision-language models, as well as specialized retrieval techniques.

The dataset includes 1,000 object instances across diverse domains, with:

  • 1,232 image queries, depicting query objects on clean or uniform background.
  • 4,715 positive images, featuring the query objects in real-world conditions with clutter, occlusions, scale variations, and partial views.
  • 1,000 text queries, providing fine-grained textual descriptions of the query objects.
  • 100M distractors from YFCC100M to evaluate retrieval performance under large-scale settings, while asserting noise-free ground truth.

object instance. Manually collected 5,947 images -- 1,232 query images and 4,715 positives -- and 1,000 text queries

Retrieval in large-scale settings is achieved using 100M distractors from the YFCC100M dataset

Noise-free ground truth guaranteed by collecting objects non publicly available before 2014, the YFCC100M compilation date

foundation and legacy models are evaluated combined with linear adaptation and various re-ranking techniques

Benchmark

performance on image-to-image retrieval

name year arch train dims dataset data size train res test res 5M 100M
alexnet.tv_in1k (adapt) 2012 CNN sup 256 in1k 1M 224 384 1.9 1.3
convnext_base.clip_laion2b_augreg (adapt) 2022 CN-B vla 640 laion2b 2B 256 384 18.1 14
convnext_base.fb_in1k (adapt) 2022 CN-B sup 1024 in1k 1M 288 384 3.9 2.7
convnext_base.fb_in22k (adapt) 2022 CN-B sup 1536 in22k 14M 224 384 9.9 7.6
convnext_large_mlp.clip_laion2b_ft_soup_320 (adapt) 2022 CN-L vla 768 laion2b 2B 320 512 22.9 18.3
convnext_large.fb_in1k (adapt) 2022 CN-L sup 1024 in1k 1M 288 384 4.2 2.9
convnext_large.fb_in22k (adapt) 2022 CN-L sup 1536 in22k 14M 288 384 9.1 6.9
cvnet_resnet101 (adapt) 2022 R101 sup 2048 gldv2 1M 512 724 4.2 3.1
cvnet_resnet50 (adapt) 2022 R50 sup 2048 gldv2 1M 512 724 3.5 2.6
deit3_base_patch16_224.fb_in1k (adapt) 2021 ViT-B sup+dist 768 in1k 1M 224 384 2.7 1.8
deit3_large_patch16_224.fb_in1k (adapt) 2021 ViT-L sup+dist 1024 in1k 1M 224 384 3.3 2.4
densenet169.tv_in1k (adapt) 2016 CNN sup 2048 in1k 1M 224 384 2.9 2
dino_resnet50 (adapt) 2021 R50 ssl 2048 in1k 1M 224 384 4.1 2.9
dino_vitb16 (adapt) 2021 ViT-B ssl 768 in1k 1M 224 384 6.6 4.8
dinov2_vitb14 (adapt) 2023 ViT-B ssl 768 lvd142m 142M 518 724 15 12.1
dinov2_vitb14_reg (adapt) 2024 ViT-B ssl 768 lvd142m 142M 518 724 13.5 10.7
dinov2_vitl14 (adapt) 2023 ViT-L ssl 1024 lvd142m 142M 518 724 18.8 15.3
dinov2_vitl14_reg (adapt) 2024 ViT-L ssl 1024 lvd142m 142M 518 724 17.1 13.6
eva02_base_patch14_224.mim_in22k (adapt) 2023 ViT-B ssl 768 in22k 14M 224 384 4.7 3.2
eva02_base_patch16_clip_224.merged2b (adapt) 2023 ViT-B vla 512 merged2b 2B 224 384 11.7 8.7
eva02_large_patch14_224.mim_in22k (adapt) 2023 ViT-L ssl 1024 in22k 14M 224 384 3.9 2.7
eva02_large_patch14_224.mim_m38m (adapt) 2023 ViT-L ssl 1024 merged38m 38M 224 384 8.8 6.1
eva02_large_patch14_clip_336.merged2b (adapt) 2023 ViT-L vla 768 merged2b 2B 336 512 20.9 16
hier_dino_vits16_sop (adapt) 2023 ViT-S sup 384 sop 60k 224 384 5.1 3.6
inception_v4.tf_in1k (adapt) 2017 CNN sup 1536 in1k 1M 299 512 1.5 1
moco_v3_resnet50 (adapt) 2021 R50 ssl 2048 in1k 1M 224 384 3.4 2.6
moco_v3_vitb (adapt) 2021 ViT-B ssl 768 in1k 1M 224 384 3.2 2.3
nasnetalarge.tf_in1k (adapt) 2018 CNN sup 4032 in1k 1M 331 512 1.6 1
recall_512-resnet50 (adapt) 2022 R50 sup 512 sop 60k 224 384 3.1 2.1
recall_512-vit_base_patch16_224_in21k (adapt) 2022 ViT-B sup 512 sop 60k 224 384 7.3 5.3
resnet101.tv_in1k (adapt) 2015 R101 sup 2048 in1k 1M 224 384 2.7 1.8
resnet50.tv_in1k (adapt) 2015 R50 sup 2048 in1k 1M 224 384 2.5 1.8
RN50.openai (adapt) 2021 R50 vla 1024 opanai 400M 224 384 8.5 6
superglobal_resnet101 (adapt) 2023 R101 sup 2048 gldv2 1M 512 724 4.5 3.2
superglobal_resnet50 (adapt) 2023 R50 sup 2048 in1k 1M 224 384 3 2.4
swav_resnet50 (adapt) 2021 R50 ssl 2048 in1k 1M 224 384 2.9 2.1
tf_efficientnet_b4.ns_jft_in1k (adapt) 2019 CNN sup+dist 1792 in1k 1M 380 512 4.3 2.9
udon_64-vitb_clip_openai (adapt) 2024 ViT-B sup 768 uned 2.8M 224 384 9.2 6.7
udon_64-vitb_in21k_ft_in1k (adapt) 2024 ViT-B sup 768 uned 2.8M 224 384 7.3 5.3
unic_l (adapt) 2024 ViT-L dist 1024 in1k 1M 518 512 15.3 11.7
unicom_vit_base_patch16_224 (adapt) 2023 ViT-B dist 768 laion400m 400M 224 384 13.8 11.1
unicom_vit_base_patch16_gldv2 (adapt) 2023 ViT-B sup 768 gldv2 400M 512 724 4.1 3.3
unicom_vit_base_patch16_sop (adapt) 2023 ViT-B sup 768 sop 400M 224 384 12.8 9.9
unicom_vit_large_patch14_224 (adapt) 2023 ViT-L dist 768 laion400m 400M 224 384 17.7 13.8
unicom_vit_large_patch14_336 (adapt) 2023 ViT-L dist 768 laion400m 400M 336 512 18.6 14.6
uscrr_64-vit_base_patch16_clip_224.openai (adapt) 2023 ViT-B sup 768 uned 2.8M 224 724 6.4 4.3
vgg16.tv_in1k (adapt) 2014 CNN sup 512 in1k 1M 224 384 2.3 1.6
vit_base_patch16_224.augreg_in1k (adapt) 2020 ViT-B sup 768 in1k 1M 224 384 1.9 1.3
vit_base_patch16_224.augreg_in21k (adapt) 2020 ViT-B sup 768 in21k 14M 224 384 6.2 4.4
vit_base_patch16_clip_224.metaclip_2pt5b (adapt) 2024 ViT-B vla 768 2pt5b 2.5B 224 384 12.7 9.4
vit_base_patch16_clip_224.openai (adapt) 2021 ViT-B vla 512 opanai 400M 224 384 10.7 7.9
vit_base_patch16_siglip_224.webli (adapt) 2023 ViT-B vla 768 webli 10B 224 384 19.4 15.7
vit_base_patch16_siglip_256.webli (adapt) 2023 ViT-B vla 768 webli 10B 256 384 20.6 16.7
vit_base_patch16_siglip_384.webli (adapt) 2023 ViT-B vla 768 webli 10B 384 512 26.2 21.5
vit_base_patch16_siglip_512.webli (adapt) 2023 ViT-B vla 768 webli 10B 512 724 27.5 23
vit_large_patch14_clip_224.laion2b (adapt) 2021 ViT-L vla 768 laion2b 2B 224 384 17.5 13.7
vit_large_patch14_clip_224.metaclip_2pt5b (adapt) 2024 ViT-L vla 1024 2pt5b 2.5B 224 384 21.7 16.9
vit_large_patch14_clip_224.openai (adapt) 2021 ViT-L vla 768 opanai 400M 224 384 15.8 11.9
vit_large_patch14_clip_336.openai (adapt) 2021 ViT-L vla 768 opanai 400M 336 512 19.9 15.2
vit_large_patch16_224.augreg_in21k (adapt) 2020 ViT-L sup 1024 in21k 14M 224 384 7.3 5.3
vit_large_patch16_224.augreg_in21k_ft_in1k (adapt) 2020 ViT-L sup 1024 in1k 1M 224 384 6.6 4.7
vit_large_patch16_384.augreg_in21k_ft_in1k (adapt) 2020 ViT-L sup 1024 in1k 1M 384 512 8.7 6.4
vit_large_patch16_siglip_256.webli (adapt) 2023 ViT-L vla 1024 webli 10B 256 384 26.3 21.8
vit_large_patch16_siglip_384.webli (adapt) 2023 ViT-L vla 1024 webli 10B 384 512 34.3 28.9
vit_base_patch16_siglip_384.v2_webli (adapt) 2025 ViT-B vla 768 webli 10B 384 512 27.5 22.6
vit_base_patch16_siglip_384.v2_webli (adapt) 2025 ViT-B vla 768 webli 10B 512 724 28.6 23.5
vit_large_patch16_siglip_384.v2_webli (adapt) 2025 ViT-L vla 1024 webli 10B 384 512 36.3 30.3
vit_large_patch16_siglip_512.v2_webli (adapt) 2025 ViT-L vla 1024 webli 10B 512 724 37.3 31.3
alexnet.tv_in1k 2012 CNN sup 256 in1k 1M 224 384 2 1.5
convnext_base.clip_laion2b_augreg 2022 CN-B vla 640 laion2b 2B 256 384 10.7 7.9
convnext_base.fb_in1k 2022 CN-B sup 1024 in1k 1M 288 384 2.8 2
convnext_base.fb_in22k 2022 CN-B sup 1536 in22k 14M 224 384 8.9 6.4
convnext_large_mlp.clip_laion2b_ft_soup_320 2022 CN-L vla 768 laion2b 2B 320 512 12.7 9.6
convnext_large.fb_in1k 2022 CN-L sup 1024 in1k 1M 288 384 3.2 2.2
convnext_large.fb_in22k 2022 CN-L sup 1536 in22k 14M 288 384 8.6 6.6
cvnet_resnet101 2022 R101 sup 2048 gldv2 1M 512 724 3.9 3
cvnet_resnet50 2022 R50 sup 2048 gldv2 1M 512 724 3.7 2.9
deit3_base_patch16_224.fb_in1k 2021 ViT-B sup+dist 768 in1k 1M 224 384 1.9 1.2
deit3_large_patch16_224.fb_in1k 2021 ViT-L sup+dist 1024 in1k 1M 224 384 2 1.5
densenet169.tv_in1k 2016 CNN sup 2048 in1k 1M 224 384 3.2 2.4
dino_resnet50 2021 R50 ssl 2048 in1k 1M 224 384 3.8 2.9
dino_vitb16 2021 ViT-B ssl 768 in1k 1M 224 384 5 3.7
dinov2_vitb14 2023 ViT-B ssl 768 lvd142m 142M 518 724 14.3 11.5
dinov2_vitb14_reg 2024 ViT-B ssl 768 lvd142m 142M 518 724 11.8 9.4
dinov2_vitl14 2023 ViT-L ssl 1024 lvd142m 142M 518 724 18.5 15.3
dinov2_vitl14_reg 2024 ViT-L ssl 1024 lvd142m 142M 518 724 15.9 12.7
eva02_base_patch14_224.mim_in22k 2023 ViT-B ssl 768 in22k 14M 224 384 3.1 2.1
eva02_base_patch16_clip_224.merged2b 2023 ViT-B vla 512 merged2b 2B 224 384 7.8 5.9
eva02_large_patch14_224.mim_in22k 2023 ViT-L ssl 1024 in22k 14M 224 384 2.5 1.5
eva02_large_patch14_224.mim_m38m 2023 ViT-L ssl 1024 merged38m 38M 224 384 6.7 4.7
eva02_large_patch14_clip_336.merged2b 2023 ViT-L vla 768 merged2b 2B 336 512 13.6 10.9
hier_dino_vits16_sop 2023 ViT-S sup 384 sop 60k 224 384 1.7 3.3
inception_v4.tf_in1k 2017 CNN sup 1536 in1k 1M 299 512 3.3 1.1
moco_v3_resnet50 2021 R50 ssl 2048 in1k 1M 224 384 2.5 2.6
moco_v3_vitb 2021 ViT-B ssl 768 in1k 1M 224 384 1.7 1.9
nasnetalarge.tf_in1k 2018 CNN sup 4032 in1k 1M 331 512 2.3 1
recall_512-resnet50 2022 R50 sup 512 sop 60k 224 384 6.8 1.6
recall_512-vit_base_patch16_224_in21k 2022 ViT-B sup 512 sop 60k 224 384 2.7 5
resnet101.tv_in1k 2015 R101 sup 2048 in1k 1M 224 384 2.3 1.9
resnet50.tv_in1k 2015 R50 sup 2048 in1k 1M 224 384 4.4 1.7
RN50.openai 2021 R50 vla 1024 opanai 400M 224 384 4.5 3.2
superglobal_resnet101 2023 R101 sup 2048 gldv2 1M 512 724 4.3 3.4
superglobal_resnet50 2023 R50 sup 2048 in1k 1M 224 384 2.2 2
swav_resnet50 2021 R50 ssl 2048 in1k 1M 224 384 3.8 1.7
tf_efficientnet_b4.ns_jft_in1k 2019 CNN sup+dist 1792 in1k 1M 380 512 7.5 2.6
udon_64-vitb_clip_openai 2024 ViT-B sup 768 uned 2.8M 224 384 8.3 5.9
udon_64-vitb_in21k_ft_in1k 2024 ViT-B sup 768 uned 2.8M 224 384 11.4 5.5
unic_l 2024 ViT-L dist 1024 in1k 1M 518 512 13.8 8.9
unicom_vit_base_patch16_224 2023 ViT-B dist 768 laion400m 400M 224 384 3.7 11
unicom_vit_base_patch16_gldv2 2023 ViT-B sup 768 gldv2 400M 512 724 12.2 3
unicom_vit_base_patch16_sop 2023 ViT-B sup 768 sop 400M 224 384 18 9.1
unicom_vit_large_patch14_224 2023 ViT-L dist 768 laion400m 400M 224 384 17.8 13.8
unicom_vit_large_patch14_336 2023 ViT-L dist 768 laion400m 400M 336 512 5.7 13.9
uscrr_64-vit_base_patch16_clip_224.openai 2023 ViT-B sup 768 uned 2.8M 224 724 3 3.8
vgg16.tv_in1k 2014 CNN sup 512 in1k 1M 224 384 1.4 2.3
vit_base_patch16_224.augreg_in1k 2020 ViT-B sup 768 in1k 1M 224 384 4.2 1
vit_base_patch16_224.augreg_in21k 2020 ViT-B sup 768 in21k 14M 224 384 8.8 3
vit_base_patch16_clip_224.metaclip_2pt5b 2024 ViT-B vla 768 2pt5b 2.5B 224 384 5.9 6.6
vit_base_patch16_clip_224.openai 2021 ViT-B vla 512 opanai 400M 224 384 14.1 4.2
vit_base_patch16_siglip_224.webli 2023 ViT-B vla 768 webli 10B 224 384 14.6 11.2
vit_base_patch16_siglip_256.webli 2023 ViT-B vla 768 webli 10B 256 384 19.3 11.5
vit_base_patch16_siglip_384.webli 2023 ViT-B vla 768 webli 10B 384 512 20.1 15.6
vit_base_patch16_siglip_512.webli 2023 ViT-B vla 768 webli 10B 512 724 11.8 16.6
vit_large_patch14_clip_224.laion2b 2021 ViT-L vla 768 laion2b 2B 224 384 14.4 9.4
vit_large_patch14_clip_224.metaclip_2pt5b 2024 ViT-L vla 1024 2pt5b 2.5B 224 384 9 11.7
vit_large_patch14_clip_224.openai 2021 ViT-L vla 768 opanai 400M 224 384 12.1 7
vit_large_patch14_clip_336.openai 2021 ViT-L vla 768 opanai 400M 336 512 6 9.4
vit_large_patch16_224.augreg_in21k 2020 ViT-L sup 1024 in21k 14M 224 384 5.1 4.6
vit_large_patch16_224.augreg_in21k_ft_in1k 2020 ViT-L sup 1024 in1k 1M 224 384 7.2 3.6
vit_large_patch16_384.augreg_in21k_ft_in1k 2020 ViT-L sup 1024 in1k 1M 384 512 18.8 5.3
vit_large_patch16_siglip_256.webli 2023 ViT-L vla 1024 webli 10B 256 384 24.2 15.2
vit_large_patch16_siglip_384.webli 2023 ViT-L vla 1024 webli 10B 384 512 24.4 19.6
vit_base_patch16_siglip_384.v2_webli 2025 ViT-B vla 768 webli 10B 384 512 18.4 15.0
vit_base_patch16_siglip_384.v2_webli 2025 ViT-B vla 768 webli 10B 512 724 18.6 15.4
vit_large_patch16_siglip_384.v2_webli 2025 ViT-L vla 1024 webli 10B 384 512 24.6 19.9
vit_large_patch16_siglip_512.v2_webli 2025 ViT-L vla 1024 webli 10B 512 724 25.3 20.8

performance on text-to-image retrieval

name year arch dims dataset data size train res test res 5M 100M
RN50.openai 2021 R50 1024 opanai 400M 224 384 2.3 1.5
vit_base_patch16_clip_224.openai 2021 ViT-B 512 opanai 400M 224 384 2.7 1.6
vit_large_patch14_clip_224.openai 2021 ViT-L 768 opanai 400M 224 384 6.7 4.6
vit_large_patch14_clip_336.openai 2021 ViT-L 768 opanai 400M 336 512 8.4 5.8
vit_large_patch14_clip_224.laion2b 2021 ViT-L 768 laion2b 2B 224 384 9.4 7.0
convnext_base.clip_laion2b_augreg 2022 CN-B 640 laion2b 2B 256 384 7.0 4.6
convnext_large_mlp.clip_laion2b_ft_soup_320 2022 CN-L 768 laion2b 2B 320 512 11.5 8.1
eva02_base_patch16_clip_224.merged2b 2023 ViT-B 512 merged2b 2B 224 384 4.4 2.5
eva02_large_patch14_clip_336.merged2b 2023 ViT-L 768 merged2b 2B 336 512 10.6 7.2
vit_base_patch16_siglip_224.webli 2023 ViT-B 768 webli 10B 224 384 10.1 7.1
vit_base_patch16_siglip_256.webli 2023 ViT-B 768 webli 10B 224 384 10.3 7.5
vit_base_patch16_siglip_384.webli 2023 ViT-B 768 webli 10B 384 512 14.4 11.0
vit_base_patch16_siglip_512.webli 2023 ViT-B 768 webli 10B 512 724 14.6 11.1
vit_large_patch16_siglip_256.webli 2023 ViT-L 1024 webli 10B 256 384 16.4 12.8
vit_large_patch16_siglip_384.webli 2023 ViT-L 1024 webli 10B 384 512 22.2 18.1
vit_base_patch16_clip_224.metaclip_2pt5b 2024 ViT-B 768 2pt5b 2.5B 224 384 7.6 4.9
vit_large_patch14_clip_224.metaclip_2pt5b 2024 ViT-L 1024 2pt5b 2.5B 224 384 13.1 9.2
vit_base_patch16_siglip_384.v2_webli 2025 ViT-B 768 webli 10B 384 512 15.1 11.1
vit_base_patch16_siglip_512.v2_webli 2025 ViT-B 768 webli 10B 512 724 14.6 10.4
vit_large_patch16_siglip_384.v2_webli 2025 ViT-L 1024 webli 10B 384 512 23.7 18.6
vit_large_patch16_siglip_512.v2_webli 2025 ViT-L 1024 webli 10B 512 724 24.7 19.8

performance on image-to-image retrieval with re-ranking

name year type global repr. local repr. top-NN 100M oracle
AMES + SigLIP (adapt) 2024 local SigLIP-L@384 (adapt) AMES-bin-dist (600) 10k 38.9 56.0
AMES + SigLIP2 (adapt) 2024 local SigLIP2-L@512 (adapt) AMES-bin-dist (100) 1k 38.4 62.7
AMES + SigLIP (adapt) 2024 local SigLIP-L@384 (adapt) AMES-bin-dist (100) 10k 36.7 56.0
AMES + SigLIP (adapt) 2024 local SigLIP-L@384 (adapt) AMES-bin-dist (100) 1k 35.6 56.0
AMES + DINOv2 (adapt) 2024 local DINOv2-L (adapt) AMES-bin-dist (100) 1k 21.8 34.0
AMES + OpenCLIP (adapt) 2024 local OpenCLIP-CN-L@320 (adapt) AMES-bin-dist (100) 1k 27.1 48.0
AMES + SigLIP 2024 local SigLIP-L@384 AMES-bin-dist (100) 1k 26.4 48.7
SP + SigLIP (adapt) 2007 local SigLIP-L@384 (adapt) DINOv2-B-reg + ITQ (100) 1k 30.5 56.0
SP + SigLIP 2007 local SigLIP-L@384 DINOv2-B-reg + ITQ (100) 1k 21.8 56.0
CS + SigLIP (adapt) 2014 local SigLIP-L@384 (adapt) DINOv2-B-reg + ITQ (100) 1k 32.5 56.0
CS + SigLIP 2014 local SigLIP-L@384 DINOv2-B-reg + ITQ (100) 1k 22.9 48.7
αQE1 + SigLIP (adapt) 2019 global SigLIP-L@384 (adapt) -- full 33.7 56.9
αQE2 + SigLIP (adapt) 2019 global SigLIP-L@384 (adapt) -- full 31.5 54.4
αQE5 + SigLIP (adapt) 2019 global SigLIP-L@384 (adapt) -- full 23.5 49.3
αQE1 + SigLIP 2019 global SigLIP-L@384 -- full 22.1 44.7
αQE2 + SigLIP 2019 global SigLIP-L@384 -- full 20.4 40.8
αQE5 + SigLIP 2019 global SigLIP-L@384 -- full 14.3 34.9

*(adapt) - Representations are linearly adapted via multi-domain learning on UnED

Explore the collected data for your instance-level research!

Browse ILIAS

Get in touch

Citation

If you find our project useful, please consider citing us:

@inproceeding{ilias2025,
title={{ILIAS}: Instance-Level Image retrieval At Scale},
author={Kordopatis-Zilos, Giorgos and Stojnić, Vladan and Manko, Anna and Šuma, Pavel and Ypsilantis, Nikolaos-Antonios and Efthymiadis, Nikos and Laskar, Zakaria and Matas, Jiří and Chum, Ondřej and Tolias, Giorgos},
booktitle={Computer Vision and Pattern Recognition (CVPR)},
year={2025},
}

Results

Sumbit your results here:

If you have any further questions, please don't hesitate to reach out to kordogeo@fel.cvut.cz