Solving Instance Detection from an Open-World Perspective

Overview

Instance detection (InsDet) aims to localize specific object instances within a novel scene imagery based on given visual references. We elaborate the open-world challenge of InsDet: (1) the testing data distribution is unknown during training, and (2) there are domain gaps between visual references and detected proposals.

Movitated by the InsDet's open-world nature, we exploit diverse open data and foundation models to solve InsDet in the open world. To better adapt FM for instance-level feature matching, we introduce distractor sampling to sample patches of random background images as universal negative data to all object instances, and novel-view synthesis generate more visual references not only training but for testing. Our IDOW outperforms prior works by >10 AP in both conventional and novel instance detection settings.

Our Findings

Existing InsDet methods leverage the open-world information in different aspects: (a) background image sampling (from the open world) to synthesize training data, (b) object image sampling (from the open world) to learn feature representations,and (c) foundation model utilization (pretrained in the open world) for proposal detection and instance-level feature matching.

Solution

As FMs are not specifcally designed for instance-level feature matching required by InsDet, we propose to adapt them by leveraging rich datasampled from the open world. We gather data from multiple sources:

Any available visual references of instances in the CID setting;
Abundant multi-view object images sampled in the open world similar to object image sampling;
Synthetic data by training NeRF to generate novel-view images based on the given instances;
Distractors by running FMs (esp. SAM) on random open-world imagery to generate random object-like proposals.

We use the data above to adapt FM through metric learning. The technical novelty of our work lies in the last two sources, as well as the design choice of metric learning to adapt FMs for InsDet.

Results

IDOW achieves state-of-the-art performances

Remarkably, we show that

IDOW signiffcantly outperforms previous methods, e.g., IDOW_{GroundingDINO} (57.01 AP) > OTS-FM_{GroundingDINO} (51.68 AP) > CPL_DINO (27.99 AP). This confirms the importance of addressing InsDet from the open-world perspective.
Second, adapting FMs by our IDOW further boosts the performance by 5-7 AP, e.g., IDOW_SAM (48.75 AP) > OTS-FM_SAM (41.61 AP).
IDOW and OTS-FM are applicable to different pretrained FMs and adopting stronger FMs achieves better performance, e.g., using GroundingDINO yields >8 AP than SAM in IDOW.

IDOW adapt FMs with diverse open data sources

We use OTS-FM_{GroundingDINO} as a baseline over which we incrementally add each strategy. Train means foundation model adaptation through fnetuning on the available data. DA denotes Data Augmentation with NeRF-generated novel-views; DS denotes Distractor Sampling. Results clearly demonstrate that all the four strategies help achieve better InsDet performance.

Finetuning FMs on the given visual references enhances detection performance, cf. Train (53.94 AP) > baseline (51.68 AP).
Moreover, NeRF-based data augmentation improves the final detection performance, particularly when used in testing, cf. Train+DA@Test (56.44 AP) > Train+DA@Train (54.48 AP) > baseline (51.68AP).
Lastly, applying distractor sampling (DS) improves the ffnal performance further, cf. Train+DS (54.10 AP) > Train (53.94 AP).

Comparison by finetuning FM and other backbones

We compare our finetuned DINOv2 against a finetuned ImageNet-pretrained ResNet101 model and the baseline instance detector CPL with a FasterRCNN architecture. Visually, the finetuned DINOv2 extracts more discriminative features.

More visualization

BibTeX

If you find our work useful, please consider citing our papers:

@inproceedings{shen2025solving,
        title={Solving Instance Detection from an Open-World Perspective},
        author={Shen, Qianqian and Zhao, Yunhan and Kwon, Nahyun and Kim, Jeeeun and Li, Yanan and Kong, Shu},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2025}
      }