Simulating deformable objects under rich interactions remains a fundamental challenge for real-to-sim robot manipulation, with dynamics jointly driven by environmental effects and robot actions. Existing simulators rely on predefined physics or data-driven dynamics without robot-conditioned control, limiting accuracy, stability, and generalization. This paper presents SoMA, a 3D Gaussian Splat simulator for soft-body manipulation. SoMA couples deformable dynamics, environmental forces, and robot joint actions in a unified latent neural space for end-to-end real-to-sim simulation. Modeling interactions over learned Gaussian splats enables controllable, stable long-horizon manipulation and generalization beyond observed trajectories without predefined physical models. SoMA improves resimulation accuracy and generalization on real-world robot manipulation by 20\%, enabling stable simulation of complex tasks such as long-horizon cloth folding.
SoMA takes RGB observations and robot joint-space actions collected from real-world manipulation as input (Left). It reconstructs deformable objects as hierarchical Gaussian splats, and propagates them through a neural simulator with supervision from rendering and dynamics (Middle). Object motion is driven by force-based interactions, where environmental and robot-induced forces act on splats to produce deformation (Right). A two-stage multi-resolution training strategy first captures global motion with large temporal gaps and then refines fine-grained dynamics under occlusion and contact using small gaps.
Qualitative resimulation and generalization under robot manipulation. Left: resimulation on training trajectories. Right: generalization to unseen robot actions and contact configurations. Across diverse soft-body objects, including near-linear (rope), near-planar (cloth), and volumetric (doll) objects, SoMA produces stable, long-horizon simulations that closely match observed dynamics. PhysTwin shows deviations under complex or unseen interactions due to real-to-sim mismatch, while GausSim often remains static or unstable in challenging scenarios.
Multi-view results Our method maintains consistent 3D geometric structure and visually accurate, physically plausible dynamics across both the main and side views, demonstrating strong viewpoint-consistent simulation.
Quantitative evaluation on resimulation and generalization under robot manipulation. We report performance comparisons with PhysTwin and GausSim across image-based and depth-based metrics. Our method achieves the best results across all metrics, demonstrating robust and accurate real-to-sim simulation.