SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks

Masoud Moghani¹, Nigel Nelson², Mohamed Ghanem³, Andres Diaz-Pinto², Kush Hari⁴,
Mahdi Azizian², Ken Goldberg⁴, Sean Huver², Animesh Garg^1,2,3

¹University of Toronto, ²NVIDIA, ³Georgia Institute of Technology, ⁴University of California, Berkeley

IEEE International Conference on Robotics and Automation (ICRA) 2025

Paper arXiv Video Code

Overview: (a) Photorealistic human anatomical organs and textures in ORBIT-Surgical.
(b-f) Visuomotor behavior cloning policies performing fine-grained robotic maneuvers performed during surgery and the hands-on training exercises used in tabletop surgical curricula: (b) Tissue Retraction, (c) Needle Lift, (d) Needle Handover, (e) Suture Pad, (f) Block Transfer.

Abstract

Behavior cloning can facilitate learning of dexterous manipulation skills, yet the complexity of surgical environments, the difficulty and expense of obtaining patient data, and robot calibration errors present unique challenges for surgical robot learning. We provide an enhanced surgical digital twin with photorealistic human anatomical organs, integrated into a comprehensive simulator designed to generate high-quality synthetic data for solving fundamental tasks in surgical autonomy. We present SuFIA-BC: exploring visual Behavior Cloning policies for Surgical First Interactive Autonomy Assistants. We investigate visual observation spaces including multi-view cameras and 3D visual representations extracted from a single endoscopic camera view. Through systematic evaluation, we find that the diverse set of photorealistic surgical tasks introduced in this work enable the nuanced evaluation of prospective behavior cloning models for the unique challenges posed by surgical environments. We observe that current state-of-the-art behavior cloning techniques struggle to solve the contact-rich and complex tasks evaluated in this work, regardless of their underlying perception or control architectures. These findings underscore the importance of tailoring perception pipelines and control architectures, as well as larger scale synthetic datasets tailored to the specific demands of surgical tasks.

Video

Surgical Digital Twin

This workflow illustrates the full pipeline for creating photorealistic anatomical models, from raw CT volume data to final OpenUSD in Nvidia Omniverse. The process includes organ segmentation, mesh conversion, mesh cleaning and refinement, photorealistic texturing, and culminating into a unified OpenUSD.

The photorealistic human organ models are available on GitHub.

Policy Rollout

3D Diffusion Policy rollout with point cloud derived from the primary task camera.

Tissue Retraction

Needle Lift

Needle Handover

Suture Pad

Block Transfer

Viewpoint Robustness

Examples of ACT Multi-Camera models: trained on primary camera views (train) and evaluated under two perturbations: minor perturbations in camera positioning (view 1) and significant changes in viewpoint (view 2). It is important to note that in the multi-camera visual input, the wrist cameras maintain a consistent viewport throughout the evaluation.

(train)

(view 1)

(view 2)

Instance Generalization

Examples of ACT Multi-Camera models for needle instance generalization: We assess the effectiveness of the policies trained only on the primary suture needle (Needle N1) in lifting previously unseen suture needles with irregular shapes (Needles N2 - N5) at test time.

(Needle N1)

(Needle N2)

(Needle N3)

(Needle N4)

(Needle N5)

Acknowledgment

We would like to thank Miguel Guerrero, Vanni Brighella, and Ernesto Pacheco for their assistance in creating photorealistic human organ models.

Contact

For any questions, please feel free to contact Masoud Moghani and Animesh Garg.