Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references. We elaborate the open-world challenge of InsDet: (1) the testing data distribution is unknown during training, and (2) there are domain gaps between visual references and detected proposals.
Movitated by the InsDet's open-world nature, we exploit diverse open data and foundation models to solve InsDet in the open world. To better adapt FM for instance-level feature matching, we introduce distractor sampling to sample patches of random background images as universal negative data to all object instances, and novel-view synthesis generate more visual references not only training but for testing. Our IDOW outperforms prior works by >10 AP in both conventional and novel instance detection settings.
Existing InsDet methods leverage the open-world information in different aspects: (a) background image sampling (from the open world) to synthesize training data, (b) object image sampling (from the open world) to learn feature representations,and (c) foundation model utilization (pretrained in the open world) for proposal detection and instance-level feature matching.
As FMs are not specifcally designed for instance-level feature matching required by InsDet, we propose to adapt them by leveraging rich datasampled from the open world. We gather data from multiple sources:
Remarkably, we show that
We use OTS-FMGroundingDINO as a baseline over which we incrementally add each strategy. Train means foundation model adaptation through fnetuning on the available data. DA denotes Data Augmentation with NeRF-generated novel-views; DS denotes Distractor Sampling. Results clearly demonstrate that all the four strategies help achieve better InsDet performance.
We compare our finetuned DINOv2 against a finetuned ImageNet-pretrained ResNet101 model and the baseline instance detector CPL with a FasterRCNN architecture. Visually, the finetuned DINOv2 extracts more discriminative features.
If you find our work useful, please consider citing our papers:
@inproceedings{shen2025solving,
title={Solving Instance Detection from an Open-World Perspective},
author={Shen, Qianqian and Zhao, Yunhan and Kwon, Nahyun and Kim, Jeeeun and Li, Yanan and Kong, Shu},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}