I am a third-year Master’s student in the College of Computer Science and Technology at Zhejiang University, under the supervision of Professor Zhou Zhao. I also completed my Bachelor’s degree at Zhejiang University.

My research focuses on Multi-modal Learning, 3D Scene Understanding, and Embodied AI. Currently, I am researching to enhance multi-modal perception and reasoning capabilities for robot policies during my internship at OpenRobotLab, advised by Yilun Chen and Jiangmiao Pang.

📝 Publications

NeurIPS 2024
sym

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers.

Haifeng Huang, Yilun Chen, Zehan Wang, Rongjie Huang, Runsen Xu, Tai Wang, Luping Liu, Xize Cheng, Yang Zhao, Jiangmiao Pang, Zhou Zhao

  • Chat-Scene is a 3D LLM which processes both point clouds and multi-view images for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.
  • (Sep. 2024) Ranked 1st on the grounding benchmark ScanRefer and the captioning benchmark Scan2Cap.
Arxiv 2024
sym

Grounded 3D-LLM with Referent Tokens.

Yilun Chen*, Shuai Yang*, Haifeng Huang*, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang.

  • Grounded 3D-LLM establishes a correspondence between 3D scenes and language phrases through referent tokens.
  • Create a large-scale grounded scene caption dataset at phrase-level.
Arxiv 2023
sym

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes.

Zehan Wang*, Haifeng Huang*, Yang Zhao, Ziang Zhang, Zhou Zhao.

  • Chat-3D is one of the frist 3D LLMs.
ICCV 2023
sym

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding.

Zehan Wang*, Haifeng Huang*, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao

  • The first weakly-supervised 3D visual grounding method.
NeurIPS 2022
sym

Towards Effective Multi-modal Interchanges in Zero-resource Sounding Object Localization.

Yang Zhao*, Chen Zhang*, Haifeng Huang*, Haoyuan Li, Zhou Zhao

  • A method for sounding object localization without training on any prior data in this field.