AssetField: Assets Mining and Reconfiguration in
Ground Feature Plane Representation
Both indoor and outdoor environments are inherently structured and repetitive. Traditional modeling pipelines keep an asset library storing unique object templates, which is both versatile and memory efficient in practice. Inspired by this observation, we propose AssetField, a novel neural scene representation that learns a set of object-aware ground feature planes to represent the scene, where an asset library storing template feature patches can be constructed in an unsupervised manner. Unlike existing methods which require object masks to query spatial points for object editing, our ground feature plane representation offers a natural visualization of the scene in the bird-eye view, allowing a variety of operations (e.g. translation, duplication, deformation) on objects to configure a new scene. With the template feature patches, group editing is enabled for scenes with many recurring items to avoid repetitive work on object individuals. We show that AssetField not only achieves competitive performance for novel-view synthesis but also generates realistic renderings for new scene configurations.
Use AssetField to Create your own city!
Editing two city scenes collected from Google Earth ©2023 Google. AssetField is versatile where users can directly operate on the ground feature plane, supporting both within-scene and cross-scene editing, producing realistic rendering results. Left: Novel view rendering of the original Colosseum (Rome) and Sagrada Familia (Barcelona) scene. Right: Two Colosseum of different sizes are inserted into a park scene; Colosseum and Sangrada Familia are inserted into a park scene.
Overview of AssetField. (a) We demonstrate on a scene without background for clearer visuals. (b) The proposed ground feature plane representation factorizes a neural field into a horizontal feature plane and a vertical feature axis. (c) We further integrate color and semantic field into a 2D neural plane, which is decoded into 3D-aware features with the geometry guidance from scene density. The inferred RGB-DINO plane is rich in object appearance and semantic clues whilst being less sensitive to vertical displacement between objects, on which we can (d) detect assets and grouping them into categories. (e) For each category, we select a template object and store its density and color ground feature patches into the asset library. A cross-scene asset library can be construct by letting different scenes fit there own ground feature planes whilst sharing the same vertical feature axes and decoders/renderers.
Example Results on Synthetic Scenes
# Delete Instances
# Move Instances
# Manipulation (rescaling & translation)
# Manipulation (rotation & deletion)
Results of asset mining and scene editing with AssetField. (a) Our approach learns informative density and RGB-DINO ground feature planes that support object detection and categorization. (b) With joint training, an asset library can be constructed by storing ground feature plane patches of the radiance field (we show label patches here for easy visualization). (c) The proposed ground plane representation provides an explicit visualization of the scene configuration, which be directly manipulated by users. The altered ground feature plane is then fed to the global MLP renderer along with the shared z-axis feature to render views of the novel scenes. Basic operations such as object removal, translation, rotation and rescaling are demonstrated on the right.
Example Results on Real-world Scenes
Example editings on Mip-NeRF360 and DoNeRF dataset. Objects are identified on the RGB-DINO plane then replicated in the scene. Video left: Novel view rendering of the original scene; Video right: We replicate the metal ball in the back.
Object removal on toydesk scene from ObjectNeRF. Objects are first identified on each ground feature plane then substituted by the table feature patches. We simply ‘crop’ a feature patch from the table region and ‘paste’ it on to the object regions. Note that our method can also remove the shadow along with the object, whereas Object-NeRF cannot and leaves a.
Cross-scene object insertion. We insert the plastic bowl from toydesk1 to toydesk2. Vertical translation of the bowl is realized by jointly modeling the aligned/lifted/sunken-toydesk1 with toydesk2, and compose the new scene with feature patches inferred at different scene elevation.