How to argument open vocabulary object detection with large language model?
Insight from top 10 papers
Augmenting Open Vocabulary Object Detection with Large Language Models
Overview of Open Vocabulary Object Detection
- Open vocabulary object detection (OVOD) aims to detect objects beyond a fixed set of predefined categories
- Conventional object detectors are limited to a closed vocabulary, only able to detect objects from a fixed set of classes seen during training
- OVOD leverages large-scale vision-language models (VLMs) like CLIP (Minderer et al., 2023) to enable detection of novel, unseen object classes
Challenges in OVOD
- Limited training data for novel object classes
- Difficulty in learning robust visual-linguistic associations for diverse object concepts
- Tendency for OVOD models to overfit to seen object classes and struggle with novel classes
Leveraging Large Language Models for OVOD
- Large pre-trained language models like BERT and GPT can provide rich semantic knowledge about object concepts
- This knowledge can be used to augment OVOD models and improve their performance on novel object classes
Approaches to Leverage Language Models
-
Knowledge Distillation:
- Distill knowledge from pre-trained language models into OVOD models
- Language model embeddings can provide additional semantic information to improve object detection
-
Retrieval-Augmented Generation:
- Use language models to retrieve relevant textual information to augment OVOD models
- Retrieve related text from large language model to provide additional context for object detection
-
Prompt Engineering:
- Design prompts that leverage language model knowledge to improve OVOD performance
- Carefully crafted prompts can help OVOD models discover and leverage implicit background knowledge
Source Papers (10)
DetCLIPv3: Towards Versatile Generative Open-Vocabulary Object Detection
Scaling Open-Vocabulary Object Detection
Localized Vision-Language Matching for Open-vocabulary Object Detection
YOLO-World: Real-Time Open-Vocabulary Object Detection
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
Open-Vocabulary Object Detection using Pseudo Caption Labels
Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection
Multi-Modal Classifiers for Open-Vocabulary Object Detection
Retrieval-Augmented Open-Vocabulary Object Detection
Aligning Bag of Regions for Open-Vocabulary Object Detection