M³: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Kerui Ren^1,2, Guanghao Li^3,4, Changjian Jiang^2,5, Yingxiang Xu^6,2,
Tao Lu², Linning Xu^7,2, Junting Dong², Jiangmiao Pang², Mulin Yu†², Bo Dai†⁸

¹ Shanghai Jiao Tong University, ² Shanghai Artificial Intelligence Laboratory, ³ Fudan University, ⁴ Shanghai Innovation Institute, ⁵ Zhejiang University, ⁶ Beijing Institute of Technology, ⁷ The Chinese University of Hong Kong, ⁸ The University of Hong Kong,

† denotes corresponding authors.

arXiv Code

TL;DR: M³ is a Monocular Gaussian Splatting SLAM with a Multi-view foundation model for dense Matching.

Our framework achieves efficient pose estimation and reconstruction for both indoor and outdoor real-world video.

Method Overview

Pipeline of M³. Our framework consists of joint tracking and global optimization for pose estimation and a mapper for scene reconstruction. For monocular sequences, Pi3X processes retrieved historical keyframes and new frames in one inference to facilitate factor graph construction and keyframe selection. Following the Neural Gaussian and LOD architecture of ARTDECO, Gaussians are initialized via Laplacian norm and optimized jointly with camera poses.

Main Results

Qualitative comparisons of rendering against on-the-fly reconstruction baselines across diverse datasets. M³ preserves high-fidelity rendering details in challenging environments, particularly in regions highlighted by white rectangles.

Motion-awared Reconstruction

By detecting dynamic regions, the Motion Map enables moving objects to be excluded from the static reconstruction, thereby improving structural consistency in dynamic scenes.

Compared with Feed-forward methods

Qualitative comparisons of rendering against feed-forward Gaussian Splatting methods. M³ consistently preserves fine-grained visual details.