End-to-End Semi-Supervised Object Detection

Limitations of Multi-Stage Semi-Supervised Object Detection

Previous semi-supervised object detection methods use a multi-stage training schema (Xu et al., 2021):

The final performance is limited by the quality of the pseudo-labels generated by the initial, inaccurate detector (Xu et al., 2021)

End-to-end semi-supervised learning can gradually improve the quality of pseudo-labels during training, and the more accurate pseudo-labels in turn benefit the object detection training (Xu et al., 2021)
End-to-end training allows the object detection model and pseudo-label generation to reinforce each other, leading to better performance compared to multi-stage approaches (Kallempudi et al., 2022)

The classification loss for each unlabeled bounding box is weighted by the classification score produced by the teacher network (Xu et al., 2021)

Box jittering is used to select reliable pseudo boxes for the learning of box regression (Xu et al., 2021)

Adaptive thresholding mechanisms help the network filter out optimal bounding boxes, addressing issues like high false-negative and low precision rates (Kar et al., 2023)

Jitter-Bagging provides accurate information on localization to help refine the bounding boxes (Kar et al., 2023)

Feeding strong and weak augmented data to the teacher network generates robust pseudo-labels, helping it detect small and complex objects (Kar et al., 2023)

End-to-end semi-supervised object detection approaches outperform previous multi-stage methods by a large margin under various labeling ratios (1%, 5%, 10%) on the COCO benchmark (Xu et al., 2021)
The proposed end-to-end approach can improve a 40.9 mAP baseline detector trained using the full COCO training set by +3.6 mAP, reaching 44.5 mAP, by leveraging the 123K unlabeled images of COCO (Xu et al., 2021)
On the state-of-the-art Swin Transformer based object detector (58.9 mAP), the end-to-end semi-supervised approach can further improve the performance (Xu et al., 2021)

SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

Revisiting Class Imbalance for End-to-end Semi-Supervised Object Detection