AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views

1The University of Science and Technology of China, 2Shanghai Artificial Intelligence Laboratory, 3The Chinese University of Hong Kong, 4Brown University, 5Shanghai Jiao Tong University, 6The University of Hong Kong

*Denotes Equal Contribution, Alphabetical Order

TL;DR: We introduce AnySplat, a feed‑forward network for novel‑view synthesis from uncalibrated image collections in both sparse‑ and dense‑view scenarios.

Zero-Shot Inference Results

Abstract

MY ALT TEXT

We introduce AnySplat, a feed‑forward network for novel‑view synthesis from uncalibrated image collections. In contrast to traditional neural‑rendering pipelines that demand known camera poses and per‑scene optimization, or recent feed‑forward methods that buckle under the computational weight of dense views—our model predicts everything in one shot. A single forward pass yields a set of 3D Gaussian primitives encoding both scene geometry and appearance, and the corresponding camera intrinsics and extrinsics for each input image. This unified design scales effortlessly to casually captured, multi‑view datasets without any pose annotations. In extensive zero‑shot evaluations, AnySplat matches the quality of pose‑aware baselines in both sparse‑ and dense‑view scenarios while surpassing existing pose‑free approaches. Moreover, it greatly reduce rendering latency compared to optimization‑based neural fields, bringing real‑time novel‑view synthesis within reach for unconstrained capture settings.

Overview of AnySplat

MY ALT TEXT

Starting from a set of uncalibrated images, a transformer-based geometry encoder is followed by three decoder heads: \(\mathrm{F}_G\), \(\mathrm{F}_D\), and \(\mathrm{F}_C\), which respectively predict the Gaussian parameters (\(\boldsymbol{\mu}, \sigma, \boldsymbol{r}, \boldsymbol{s}, \boldsymbol{c}\)), the depth map \(D\), and the camera poses \(p\). These outputs are used to construct a set of pixel-wise 3D Gaussians, which is then voxelized into pre-voxel 3D Gaussians with the proposed Differentiable Voxelization module. From the voxelized 3D Gaussians, multi-view images and depth maps are subsequently rendered. The rendered images are supervised using an RGB loss against the ground truth image, while the rendered depth maps, along with the decoded depth \(D\) and camera poses \(p\), are used to compute geometry losses. The geometries are supervised by pseudo-geometry priors (\(\tilde{D}, \tilde{p}\)) obtained by the pretrained VGGT.

Results

MY ALT TEXT

Example visualization of AnySplat reconstruction and novel-view synthesis. From top to bottom, the number of input images increases—from extremely sparse to medium and dense captures, while the scene scale grows from object-centric setups through mid-scale trajectories to large-scale indoor and outdoor environments. For each setting, we display the input views, the reconstructed 3D Gaussians, the corresponding ground-truth renderings, and example novel-view renderings.

MY ALT TEXT

Qualitative comparisons against baseline methods: for sparse-view inputs, we benchmark against the state-of-the-art FLARE and NoPoSplat; for dense-view inputs, we include 3DGS and MipSplatting as representative comparisons.

Zero-Shot Inference Results

Re10K (2 views)

BungeeNeRF (8 views)

DTU (8 views)

LLFF (8 views)

32 views (MatrixCity/lerf/fillbusters/tnt)

Eyefultower (64 views)

Other 64 views (KITTI-360/Horizon-GS/ZipNeRF)