M³: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

1 Shanghai Jiao Tong University, 2 Shanghai Artificial Intelligence Laboratory, 3 Fudan University, 4 Shanghai Innovation Institute, 5 Zhejiang University, 6 Beijing Institute of Technology, 7 The Chinese University of Hong Kong, 8 The University of Hong Kong,
† denotes corresponding authors.

TL;DR: is a M Gaussian Splatting SLAM with a Multi-view foundation model for dense Matching.

Our framework achieves efficient pose estimation and reconstruction for both indoor and outdoor real-world video.



Method Overview

Pipeline of M³. Our framework consists of joint tracking and global optimization for pose estimation and a mapper for scene reconstruction. For monocular sequences, Pi3X processes retrieved historical keyframes and new frames in one inference to facilitate factor graph construction and keyframe selection. Following the Neural Gaussian and LOD architecture of ARTDECO, Gaussians are initialized via Laplacian norm and optimized jointly with camera poses.

Main Results

Qualitative comparisons of rendering against on-the-fly reconstruction baselines across diverse datasets. M³ preserves high-fidelity rendering details in challenging environments, particularly in regions highlighted by white rectangles.

Motion-awared Reconstruction

By detecting dynamic regions, the Motion Map enables moving objects to be excluded from the static reconstruction, thereby improving structural consistency in dynamic scenes.

Compared with Feed-forward methods

Qualitative comparisons of rendering against feed-forward Gaussian Splatting methods. M³ consistently preserves fine-grained visual details.