Embodied Tree of Thoughts:
Deliberate Robot Manipulation Planning with World Models


Wenjiang Xu1,5, Jiayu Wang2, Rui Fang2, Mingkang Zhang2, Lusong Li3, Jiayuan Gu4, Zecui Zeng3†, Rui Chen2† 1 University of Chinese Academy of Sciences (UCAS); 2 Tsinghua University; 3 JD Explore Academy; 4 ShanghaiTech University; 5 Nanjing University † Corresponding author
Project Website ArXiv Code

Abstract

World models have emerged as a pivotal component in robot manipulation planning, enabling agents to predict future environmental states and reason about the consequences of actions before execution. While video-generation models are increasingly adopted, they often lack rigorous physical grounding, leading to hallucinations and a failure to maintain consistency in long-horizon physical constraints. To address these limitations, we propose Embodied Tree of Thoughts (EToT), a novel Real2Sim2Real planning framework that leverages a physics-based interactive digital twin as an embodied world model. EToT formulates manipulation planning as a tree search expanded through two synergistic mechanisms: (1) Priori Branching, which generates diverse candidate execution paths based on semantic and spatial analysis ; and (2) Reflective Branching which utilizes VLMs to diagnose execution failures within the simulator and iteratively refine the planning tree with corrective actions. By grounding high-level reasoning in a physics simulator, our framework ensures that generated plans adhere to rigid-body dynamics and collision constraints. We validate EToT on a suite of short- and long-horizon manipulation tasks, where it consistently outperforms baselines by effectively predicting physical dynamics and adapting to potential failures.
Framework of the Embodied Tree of Thoughts

Given a task instruction, the system first reconstructs the real scene into an interactive 3D digital twin. It then constructs a world-model-grounded planning tree through Priori Branching and Reflective Branching. Priori Branching proposes initial candidate branches, while Reflective Branching analyzes simulated execution failures to expand the tree with revised branches. Through iterative searching and expansion of the planning tree, the system identifies a feasible plan, which is finally executed on the real robot in a closed-loop manner with visual feedback and re-planning.

Task 1
Next action
Simulator view
Camera view
Third-person view
Open the door of the microwave oven
Initial State
Initial State
Open the door
New Branch
Pick up the tennis
Put on the desk
Open the door
Task 2
Next action
Simulator view
Camera view
Third-person view
Reorient the pen and drop it into a pen holder.
Initial State
Initial State
Pick up the pen
Put into holder 1
Put into holder 2
Task 3
Next action
Simulator view
Camera view
Third-person view
Pick up the holder horizontally or vertically.
Initial State
Initial State
Pick up holder (horizontally)
Pick up holder (vertically)
Task 4
Next action
Simulator view
Camera view
Third-person view
Close the drawer.
Initial State
Initial State
Close the drawer
New Branch
Pick up the toy
Put on the drawer
Close the drawer
Disturbance
Next action
Simulator view
Camera view
Third-person view
Pick up a tennis ball.
Initial State
Initial State
Pick up the tennis 1
Pick up the tennis 2
Disturbance
Pick up the tennis 1
Pick up the tennis 2
Task 5
Next action
Simulator
Camera
3rd Person
Reorient the pen and drop it into a pen holder.
Initial State
Initial State
Pick up the pen
Put into holder 1
Put into holder 2
New Branch
Pick up the apple
Put on the desk
Pick up the pen
Put into holder 2
Task 6
Next action
Simulator
Camera
3rd Person
Put the apple and pen holder on the drawer, apple in the holder.
Initial State
Initial State
Pick up the apple
Put in the holder
Pick up the holder (3)
New Branch
Pick up the holder (4)
Put on the drawer (5)
Pick up the apple (6)
Put into holder (7)
Task 7
Next action
Simulator
Camera
3rd Person
Put the apple and tennis ball in drawer or holder.
Initial State (0)
Initial State
Pick apple (1)
Put in holder (3)
Pick tennis (5)
Put into holder 1 (8)
Open drawer (6)
Pick tennis (9)
New Branch
Put on drawer (11)
Open drawer (12)
Pick tennis (13)
Put in drawer (14)
Close drawer (15)
Open drawer (2)
Pick apple (4)
Put in drawer (7)
Close drawer (10)

Citation

  @misc{xu2025embodiedtreethoughtsdeliberate,
        title={Embodied Tree of Thoughts: Deliberate Manipulation Planning with Embodied World Model}, 
        author={Wenjiang Xu and Cindy Wang and Rui Fang and Mingkang Zhang and Lusong Li and Jing Xu and Jiayuan Gu and Zecui Zeng and Rui Chen},
        year={2025},
        eprint={2512.08188},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2512.08188},
  }