Grid-guided Neural Radiance Fields for Large Urban Scenes

The Chinese University of Hong Kong1           Shanghai Artificial Intelligence Laboratory2          
Max Planck Institute for Informatics3           Zhejiang University4           Adobe Research5          
*denotes equal contribution



Target Scenes. In this work, we perform large urban scene rendering with novel grid-guided neural radiance fields. An example of a large urban scene is shown on the left, which spans over 2.7km^2 ground areas captured by over 5k drone images. We show that the rendering results from NeRF-based methods, are blurry and overly smoothed with limited model capacity, while feature grid-based methods tend to display noisy artifacts when adapting to large-scale scenes with high-resolution feature grids. Our proposed two-branch model combines the merits from both approaches and achieves photorealistic novel view renderings with remarkable improvements over existing methods. Both of the two branches gain significant enhancements over their individual baselines.

Urban Roaming Experience

Example Results on Real wold Ubran Scenes. The long trajectory of rendered novel views from our model delivers an immersive experience for city roaming.



Overview of GridNeRF. Our method consists of two branches, namely the grid branch and NeRF branch, highlighted in the right boxes. 1) We start by fast capturing the scene with a pyramid of feature planes at the pre-train stage, and performing a coarse sampling of ray points and predicting their radiance values through a shallow MLP renderer (grid branch), supervised by the MSE loss on the volumetrically integrated pixel colors. This step yields a set of informative multi-resolution density/appearance feature plane pyramids shown on the middle. 2) Next, we proceed to the joint learning stage and perform a finer sampling. We use the pre-trained feature grid to guide NeRF branch sampling to concentrate on the scene surface. The sampled points' grid feature is inferred by bilinear interpolation on the feature planes. The features are then concatenated with the positional encoding and fed to NeRF branch to predict volume density and color. Note that, the grid branch maintains being supervised with the ground truth images along with NeRF's fine-rendering results.

Ground Feature Maps


Refined grid feature maps. Visualization of one feature component in (a) density and (b) appearance feature plane (Residential scene). Compared to the pre-trained feature planes, the rectified ones are less noisy; sharper edges and regular shapes of grouped objects can also be clearly identified. Since density and appearance features are independently learned, they encode different information that describes the scene. The appearance feature can capture environmental effects like shadows, as shown in (b).


Grid Branch Outputs. Qualitative comparison showing the rendering results using features learned (a) at a moderate grid resolution (2048^2), (b) at a high grid resolution (4096^2) and (c) from the rectified grid branch at resolution (4096^2). Despite higher grid resolution leads to better visual quality, adding NeRF supervision pushes the quality toward photorealistic one step further.

Two branch outputs

Rendering from two branches. Without global continuity prior, rendering from Grid branch tends to get noisy floats in the air without 3D consistency.