Open Vocabulary Object Detection with LLMs

What is Open Vocabulary Object Detection?

Open vocabulary object detection aims to build object detectors that can recognize a large number of object categories, including both 'base' categories seen during training and 'novel' categories not seen before.
Traditionally, object detectors have been limited to a fixed set of object categories defined during training. Open vocabulary detection breaks this limitation by leveraging advances in vision-language models (VLMs) to enable detection of a much broader set of object classes.
This is crucial for real-world applications where the set of objects to be detected is not known a priori and can change over time. (Kim et al., 2024), (Cho et al., 2023)

Large language models (LLMs) like GPT have shown impressive capabilities in understanding and generating natural language. These models can capture rich semantic knowledge about the world, which is crucial for open vocabulary object detection.
LLMs can provide powerful language understanding to complement the visual recognition capabilities of object detectors. By aligning the language and visual representations, LLMs enable detectors to recognize a much broader set of object categories, including novel ones. (Lei et al., 2024), (Minderer et al., 2023)
LLMs can also provide rich contextual information beyond just object category names. This allows detectors to better understand and localize objects in complex scenes, going beyond simple object recognition. (Lei et al., 2024), (Wu et al., 2023)

One approach is to leverage pre-trained VLMs like CLIP by distilling their knowledge into an object detector. This allows the detector to benefit from the open-vocabulary capabilities of the VLM. (Kim et al., 2024), (Cho et al., 2023)
Challenges include protecting the VLM's open-vocabulary knowledge during detector training and effectively aligning the visual and language representations. (Minderer et al., 2023)

Another approach is to use self-training, where an existing detector is used to generate pseudo-labels on weakly supervised data (e.g., image-text pairs) to expand the training dataset. (Minderer et al., 2023), (Cho et al., 2023)
Challenges include choosing the appropriate label space, filtering pseudo-labels effectively, and ensuring efficient training on the large-scale weakly supervised data. (Minderer et al., 2023)

Some methods use prompting techniques to leverage the language understanding of LLMs for open vocabulary detection. This includes using prompts to guide the detector towards recognizing novel object categories. (Wu et al., 2023), (Li et al., 2024)
Challenges include designing effective prompts and ensuring the detector can leverage the implicit knowledge captured by the LLMs. (Li et al., 2024)

Open vocabulary object detection is typically evaluated on datasets like COCO and LVIS, which contain a large number of object categories. Metrics like average precision (AP) on base, novel, and overall categories are used to assess performance. (Kim et al., 2024), (Cho et al., 2023)
Newer benchmarks like OV-COCO and OV-LVIS have been proposed to specifically evaluate open vocabulary detection, with a focus on novel category performance. (Li et al., 2024), (Zhao et al., 2022)

Key challenges include effectively leveraging the rich semantic knowledge in LLMs, dealing with the long-tailed distribution of object categories, and scaling up training data and model size to achieve robust open vocabulary detection performance. (Minderer et al., 2023), (Ge et al., 2023)

The Devil is in the Fine-Grained Details: Evaluating open-Vocabulary Object Detectors for Fine-Grained Understanding

Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching