Matrix3D: Large Photogrammetry Model
All-in-One

CVPR 2025 (Highlight)

Yuanxun Lu^1*, Jingyang Zhang^2*, Tian Fang², Jean-Daniel Nahmias², Yanghai Tsin², Long Quan³, Xun Cao¹, Yao Yao¹, Shiwei Li²

¹Nanjing University, ²Apple, ³The Hong Kong University of Science and Technology
^*Equal Contribution Corresponding Author

Paper arXiv Code

TL;DR: We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using the same model.

Abstract

We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation.

Compositional Inference Pipeline for Hybrid Tasks

An example of utilizing Matrix3D for single/few-shot reconstruction. Before 3DGS optimization, we complete the input set by pose estimation, depth estimation and novel view synthesis, all of which are done by the same model.

Example: Unposed 3D Reconstruction

Matrix3D enables hybrid tasks like unposed sparse-view 3d reconstruction by compositing several sub-tasks.
Users could generate several novel views RGBs & Depths observations under certain splined camera trajectories, which could be later sent to a 3DGS reconstruction pipeline for final reconstruction.

How it works?

We train the Matrix3D by masked learning. Multi-modal data are randomly masked by noise corruption. Observations and noisy maps are fed into the encoder and the decoder respectively. The model learns to denoise the corrupted maps regardless of their types. Therefore, different tasks could be represented as different masked inference.

Sub-Task Examples

Depth Prediction

Matrix3D enables depth prediction from images and poses. We could unproject the depth predictions into point clouds.
Here we show depth prediction results from 3-view posed images.