This website is under reconstruction

Monocular Real-to-Sim Scene Programming

Programming Interactive Scenes from Monocular Images for Embodied Simulation

NeoWorld-Pro transforms a single RGB image into executable, simulation-ready interactive scenes with programmable geometry, articulation, physical properties, and scene layout.

Yumeng He, Yichen Song, Xiaotian Yang, Weijia Zhang, Zanwei Zhou, Junru Gong, Xiaokang Yang, Yunbo Wang

Shanghai Jiao Tong University

Abstract

The advancement of Embodied AI requires high-quality simulation assets that faithfully mirror the physical world. NeoWorld-Pro reformulates monocular scene reconstruction as a procedural programming task for interactive 3D environments. It leverages multimodal large language models to translate a single RGB image into executable programs defining object geometry, articulation, physical properties, and scene layout.

To ensure simulation readiness, NeoWorld-Pro introduces a physics-in-the-loop mechanism that executes generated programs in a physics engine and iteratively refines them using simulation feedback. The resulting scenes support stable stacking, fine-grained manipulation, and articulated object interactions that are difficult for open-loop reconstruction pipelines.

Interactive Demo

Inspect generated assets and actuate their articulated parts in the browser.

We are actively adding more demos.

USD Preview

Loading demo

Preparing the interactive viewer.

Loading 3D scene...
Choose a case to load its simulation-ready asset.

Dataset and Benchmark

A multi-object simulation benchmark for physically executable scenes.

NeoWorld-Pro is evaluated on PartNet-Mobility and a newly constructed synthetic scene benchmark designed to stress-test monocular reconstruction in physically realistic multi-object environments. The benchmark tasks cover placement, assembly, articulation, and interaction with task-relevant affordances.

Across 84 downstream manipulation tasks, NeoWorld-Pro achieves a 92.85% task success rate.

100 object categories
80 articulated categories
30 USD-format scenes
84 downstream tasks
Examples of NeoWorld-Pro benchmark scenes with their corresponding manipulation tasks.

Results

Closed-loop programming improves reconstruction, articulation, and downstream interaction.

Object-Level Appearance and Geometry

Method Appearance Evaluation (Image) Similarity Geometry Evaluation
SSIM↑ LPIPS↓ FID↓ KID×100↓ CLIP↑ Uni3D↑ CD×10↓ F@0.01↑ F@0.05↑ F@0.1↑
Articulate-Anything 0.7646 0.2930 142.46 2.3116 0.7056 0.2218 0.2588 12.94 49.14 70.69
PhysX-Anything 0.7657 0.2991 93.42 0.4032 0.7937 0.1921 0.3744 10.14 41.16 62.65
NeoWorld-Pro 0.8398 0.1864 72.59 0.2878 0.8125 0.3522 0.2065 25.77 58.31 75.47

Articulation and Kinematics

Method Total #Pred #Hit Miss↓ Axis↓ Pivot↓ Type↓
Articulate-Anything 256 175 153 40% 47.70 1.54 23.62%
PhysX-Anything 256 188 136 47% 26.91 1.49 26.17%
NeoWorld-Pro 256 305 238 7% 16.50 1.09 5.63%

Citation

BibTeX

@misc{he2026NeoWorld-Pro,
  title  = {NeoWorld-Pro: Programming Interactive Scenes from Monocular Images for Embodied Simulation},
  author = {He, Yumeng and Song, Yichen and Yang, Xiaotian and Zhang, Weijia and Zhou, Zanwei and Gong, Junru and Yang, Xiaokang and Wang, Yunbo},
  year   = {2026},
  note   = {Project page}
}