Uni3DL: A unified model for 3D vision-language understanding

[thumbnail of Uni3DLpdf.pdf]
Text - Accepted Version
· Restricted to Repository staff only until 31 October 2025.
Restricted to Repository staff only until 31 October 2025

Please see our End User Agreement.

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Add to AnyAdd to TwitterAdd to FacebookAdd to LinkedinAdd to PinterestAdd to Email

Li, X. orcid id iconORCID: https://orcid.org/0000-0002-9946-7000, Ding, J., Chen, Z. and Elhoseiny, M. (2024) Uni3DL: A unified model for 3D vision-language understanding. In: ECCV 2024, 29 Sep — 4 Oct 2024, Milan, Italy, pp. 74-92. doi: 10.1007/978-3-031-73337-6_5

Abstract/Summary

We present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified 3D vision-language models that mostly rely on projected multi-view images and support limited tasks, Uni3DL operates directly on point clouds and significantly broadens the spectrum of tasks in the 3D domain, encompassing both vision and vision-language tasks. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively produce task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D vision-language understanding. Project page: https://uni3dl.github.io/.

Altmetric Badge

Item Type Conference or Workshop Item (Paper)
URI https://reading-clone.eprints-hosting.org/id/eprint/119818
Identification Number/DOI 10.1007/978-3-031-73337-6_5
Refereed Yes
Divisions No Reading authors. Back catalogue items
Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
Publisher Springer Nature Switzerland
Download/View statistics View download statistics for this item

University Staff: Request a correction | Centaur Editors: Update this record

Search Google Scholar