MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

NIPS 2025

¹Shanghai Jiao Tong University, ²Shanghai Artificial Intelligence Laboratory,
³Nanjing University, ⁴The Chinese University of Hong Kong,
⁵University of Science and Technology of China, ⁶The University of Hong Kong

$\dagger$ denotes corresponding author.

Abstract

We introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework’s robustness and wide generalization.

Method

Overview of MV-CoLight. (a) We insert a white puppy as the composite object onto the table between basketballs, and render multi-view inharmonious images, background-only images, and depth maps using a camera trajectory moving from distant to close-up positions. (b) We input a single-view data into the 2D object compositing model, which processes the data through multiple Swin Transformer blocks to output the harmonized result. (c) We project the multi-view features from 2D models into Gaussian space via $\Phi(\cdot)$, combine them with the original inharmonious Gaussian colors projected into 2D Gaussian color space through $\Psi(\cdot)$, and then feed them into the 3D object compositing model. The model outputs harmonized Gaussian colors and computes rendering loss by incorporating Gaussian shape attributes.

Dataset

Visualization of the DTC-MultiLight dataset. We showcase rendered results of diverse scenes created using objects from the DTC dataset within the Blender engine, highlighting multi-view perspectives and varying lighting conditions.

Results

Single-view qualitative comparisons. Compared to baselines, our method successfully generates coherence illumination and plausible shadows while decoupling highlights from inserted objects.

Multi-view qualitative comparisons. Our approach synthesizes multi-view consistent illumination and shadows while strictly preserving the original scene geometry, scale and object placement.

Real-world scene visualization. We evaluate our method on real-world scenes and achieve both color harmonization and realistic lighting/shadow generation.

Luminous Object Compositing

Visual results of inserting a luminous object. Our method simulates the illumination effects of luminous spheres within the scene environment.

Multi-view visualization. Our method meticulously simulates the emission effects of inserted light sources, their illumination on surrounding objects, and shadows.