RoboDream

Abstract

Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual augmentation, or suffer from embodiment hallucinations that yield physically infeasible motions. We present RoboDream, a generalizable embodiment-centric world model that achieves scalable data generation by synthesizing photorealistic demonstrations with novel objects, in novel scenes, and from novel viewpoints. Our approach anchors generation to rendered robot motion while conditioning on explicit scene and object priors, effectively decoupling trajectory execution from environment synthesis.

This formulation unlocks two powerful data scaling capabilities: (1) retrieval and rebirth, which repurposes existing trajectories into entirely new contexts without new motion data; and (2) prop-free teleoperation, where operators manipulate empty air and the model hallucinates the target objects and scene afterwards. We demonstrate with real-world experiments that our generated data consistently improves downstream policy performance and significantly reduces real-world data requirements across diverse manipulation tasks.

Compositional Generation

Interpret the same trajectory in multiple ways with different prior inputs.

🎨

Novel Objects

Change the object prior to swap in unseen target objects while keeping the same motion.

🌎

Novel Scenes

Replace the scene prior to place the robot in entirely new environments.

📷

Novel Views

Re-render robot motion from new camera angles with matching scene priors.

Method Overview

RoboDream conditions video generation on three decoupled inputs: (1) a rendered robot-only trajectory that anchors the embodiment; (2) a scene prior (background image) that defines the environment; and (3) an object prior (cropped object image) that specifies target objects. The rendered motion and scene prior are concatenated with noisy latent frames in channel space, while object prior tokens are injected via extended self-attention. Task instructions and global trajectory are injected via cross-attention to ensure semantic and kinematic consistency.

Automated Prior Extraction

To train RoboDream, we construct training pairs from existing robot datasets without manual annotation. Given an initial observation and task instruction, we use GPT-5-nano to identify task-relevant objects, Grounded-SAM to segment them into an object prior, and OmniPaint to inpaint the background into a clean scene prior.

Deployment Modes

Retrieval and Rebirth

Given a new task, we retrieve semantically similar trajectories from an existing dataset (e.g., DROID). These are replayed in a simulator to render robot-only motion videos from novel camera viewpoints. Combined with new scene and object priors, RoboDream synthesizes demonstrations reborn in entirely new contexts — no new motion data needed.

Prop-Free Teleoperation

Operators control the robot to perform task motions with imaginary objects (pantomime). This can happen on an empty workspace or in a simulator. The recorded trajectory is rendered, and RoboDream paints arbitrary objects and scenes. One motion to multiple tasks: a single prop-free trajectory can be reused across different tasks by simply changing the object and scene priors.

Experimental Results

Retrieval and Rebirth for Policy Learning

We evaluate on four real-world manipulation tasks using a Franka Panda robot (DROID platform). Gen-Mix (50 real + 100 generated demos) achieves 62.5% average success rate, significantly outperforming Real-50 (36.3%) and Orig-Mix (45.0%). Raw DROID data (Orig-100) completely fails (0%) due to domain shift, while RoboDream successfully bridges this gap.

Task	Real-50	Orig-Mix	Gen-100	Gen-Mix
Put Cube into Cup	35	55	20	65
Put Marker into Bowl	30	35	15	55
Remove Marker from Bowl	20	20	5	35
Wipe Table with Towel	60	70	20	95
Average	36.3	45.0	15.0	62.5

Prop-Free vs. Real Collection

Prop-free teleoperation achieves competitive performance (32.5%) compared to real data collection (36.3%) while being ~2.2x faster.

Scaling with Generated Data

Performance consistently improves as more generated data is added, saturating around Mix-200.

Task	Real-50	Real w/ Gen	Prop-Free
Put Cube into Cup	35	25	30
Put Marker into Bowl	30	20	20
Remove Marker from Bowl	20	15	20
Wipe Table with Towel	60	60	60
Average	36.3	30.0	32.5

Real-50	Mix-100	Mix-200	Mix-300	Mix-400
35	65	75	80	75
30	55	70	70	70
20	35	45	50	50
60	95	100	95	100
36.3	62.5	72.5	73.75	73.75

Takeaways

Compositional Control: By conditioning on explicit scene and object priors, RoboDream enables controllable generation across arbitrary environments and objects without task-specific fine-tuning.
Zero-Shot Generalization: Trained on large-scale multi-environment data (DROID), RoboDream synthesizes valid interactions in scenarios unseen during training, with novel objects, scenes, and viewpoints.
Scalable Data Collection: Two deployment modes — retrieval and rebirth plus prop-free teleoperation — address key bottlenecks in real-world data collection, reducing cost while maintaining quality.
Stronger Policies: Generated data consistently improves downstream policy performance, achieving 62.5% average success from just 50 real demos augmented with RoboDream data.

BibTeX

@article{ye2026robodream,
  title={RoboDream: Compositional World Models for Scalable Robot Data Synthesis},
  author={Ye, Junjie and Xue, Rong and Van Hoorick, Basile and Li, Runhao and Rajaprakash, Harshitha and Tokmakov, Pavel and Irshad, Muhammad Zubair and Guizilini, Vitor and Wang, Yue},
  journal={arXiv preprint arXiv:2606.02577},
  year={2026}
}

RoboDream
Compositional World Models for Scalable Robot Data Synthesis

RoboDream generates robot demonstrations with novel objects, in novel scenes, from novel views via compositional synthesis.

Abstract

Compositional Generation

Novel Objects

Novel Scenes

Novel Views

Method Overview

Automated Prior Extraction

Deployment Modes

Retrieval and Rebirth

Prop-Free Teleoperation

Experimental Results

Retrieval and Rebirth for Policy Learning

Prop-Free vs. Real Collection

Scaling with Generated Data

Real-World Policy Rollouts

Takeaways

BibTeX

RoboDream Compositional World Models for Scalable Robot Data Synthesis

RoboDream generates robot demonstrations with novel objects, in novel scenes, from novel views via compositional synthesis.

Abstract

Compositional Generation

Novel Objects

Novel Scenes

Novel Views

Method Overview

Automated Prior Extraction

Deployment Modes

Retrieval and Rebirth

Prop-Free Teleoperation

Experimental Results

Retrieval and Rebirth for Policy Learning

Prop-Free vs. Real Collection

Scaling with Generated Data

Real-World Policy Rollouts

Takeaways

BibTeX

RoboDream
Compositional World Models for Scalable Robot Data Synthesis