Long-Tailed 3D Detection via Multi-Modal Late-Fusion

1Zhejiang University
2Carnegie Mellon University
3Zhejiang Lab
4Antrhopic
5University of Macau

*Indicates Equal Contribution

Abstract

Contemporary autonomous vehicle (AV) benchmarks have advanced techniques for training 3D detectors, particularly on large-scale multi-modal (LiDAR + RGB) data. Surprisingly, although semantic class labels naturally follow a long-tailed distribution, existing benchmarks only focus on a few common classes (e.g., pedestrian and car) and neglect many rare but crucial classes (e.g., emergency vehicle and stroller). However, AVs must reliably detect both common and rare classes for safe operation in the open world. We address this challenge by formally studying the problem of Long-Tailed 3D Detection (LT3D), which evaluates all annotated classes, including those in-the-tail. We address LT3D with hierarchical losses that promote feature sharing across classes, and introduce diagnostic metrics that award partial credit to ``reasonable'' mistakes with respect to the semantic hierarchy (e.g., mistaking a child for an adult). Further, we point out that rare-class accuracy is particularly improved via multi-modal late fusion (MMLF) of independently trained uni-modal LiDAR and RGB detectors. Importantly, such an MMLF framework allows us to leverage large-scale uni-modal datasets (with more examples for rare classes) to train better uni-modal detectors, unlike prevailing end-to-end trained multi-modal detectors that require paired multi-modal data. Finally, we examine three critical components of our simple MMLF approach from first principles and investigate whether to train 2D or 3D RGB detectors for fusion, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections. Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy for rare classes than 3D RGB detectors and matching on the 2D image plane mitigates depth estimation errors for better matching. Our proposed MMLF approach significantly improves LT3D performance over prior work, particularly improving rare class performance from 12.8 to 20.0 mAP! Our code and models are available on our project page.

Long-Tailed 3D Benchmarking Protocol

1

According to the histogram of per-class object counts (on the left), the nuScenes benchmark focuses on the common classes in cyan (e.g., car and barrier) but ignores rare ones in red (e.g., stroller and debris). In fact, the benchmark creates a superclass pedestrian by grouping multiple classes in darkgreen, including the common class adult and several rare classes (e.g., child and police-officer); this complicates the analysis of detection performance as pedestrian performance is dominated by adult. Moreover, the ignored superclass pushable-pullable also contains diverse objects such as shopping-cart, dolly, luggage and trash-can as shown in the top row (on the right). We argue that AVs should also detect rare classes as they can affect AV behaviors. Following “Large-scale long-tailed recognition in an open world”, we report performance for three groups of classes based on their cardinality (split by dotted lines): Many, Medium, and Few.

1

We highlight common classes in white and rare classes in gold. The standard nuScenes benchmark makes two choices for dealing with rare classes: (1) ignore them (e.g., stroller and pushable-pullable), or (2) group them into coarse-grained classes (e.g., adult, child, construction-worker, police-officer are grouped as pedestrian). Since the pedestrian class is dominated by adult, the standard benchmarking protocol masks the challenge of detecting rare classes like child and police-officer.

LT3D Methods: The Devil Is In The Details

Grouping-Free Detector Head

1

We leverage the semantic hierarchy defined in the nuScenes dataset to train LT3D detectors by predicting class labels at multiple levels of the hierarchy for an object: its fine-grained label (e.g., child), its coarse class (e.g., pedestrian), and the root-level class object. This means that the final vocabulary of classes is no longer mutually exclusive, so we use a sigmoid focal loss that learns separate spatial heatmaps for each class.

Multimodal Filtering in 3D for Detection Fusion

1

Multi-Modal Filtering (MMF) effectively removes high-scoring false-positive LiDAR detections. The green boxes are ground-truth strollers, while the blue boxes are stroller detections from the LiDAR-based detector CenterPoint (left) and RGB-based detector FCOS3D (mid). The final filtered result removes LiDAR detections not within m meters of any RGB detection, shown in the red region, and keeps all other LiDAR detections, shown in the white region (right).

Delving into Multi-Modal Late-Fusion for LT3D

1

We examine three key components in the multi-modal late-fusion (MMLF) of uni-modal RGB and LiDAR detectors from first principles: A. whether to train 2D or 3D RGB detectors, B. whether to match uni-modal detections on the 2D image plane or in the 3D bird's-eye-view (BEV), and C. how to best fuse matched detections. Our exploration reveals that using 2D RGB detectors, matching on the 2D image plane, and combining calibrated scores with Bayesian fusion yields state-of-the-art LT3D performance.

1

Our multi-modal late-fusion approach takes 3D LiDAR and 2D RGB detections as input, matches 2D RGB and (projected) 3D LiDAR detections on the image plane, and fuses matched predictions with score calibration and probabilistic ensembling to produce 3D detections.

Visualizations

1

Three examples demonstrate how our multi-modal late-fusion (MMLF) approach improves LT3D by ensembling 2D RGB detections (from DINO) and 3D LiDAR detections (from CenterPoint). In all examples, MMLF correctly relabels detections which are geometrically similar (w.r.t size and shape) in LiDAR but are visually distinct in RGB, such as bus-vs-truck, adult-vs-stroller, and adult-vs-child.

The video demo shows that our method is significantly better than CMT on dataset.

Benchmarking Results

Benchmarking results on nuScenes.

1

Comparison with the Argoverse 2 state-of-the-art.

1

BibTeX

@article{yechi2024long-tailed,
        title={Long-Tailed 3D Detection via Multi-Modal Late-Fusion},
        author={Ma, Yechi and Peri, Neehar and Wei, Shuoquan and Dave, Achal and Hua, Wei and Li, Yanan and Ramanan, Deva and Kong, Shu},
        journal={arXiv preprint arXiv:2312.10986},
        year={2024}
      }
@inproceedings{peri2022towards,
        title={Towards Long Tailed 3D Detection},
        author={Peri, Neehar and Dave, Achal and Ramanan, Deva and Kong, Shu},
        booktitle={The Conference on Robot Learning (CoRL)},
        year={2022}
      }