RSS SemRob 2025

About
Semantic understanding of the world is essential for robots to make safe and informed decisions, to adapt to changing environmental conditions, and to enable efficient interactions with other agents. In pursuit of semantic understanding, agents must be able to (i) interpret and represent high-level goals, agnostic of their physical morphology and despite irrelevant aspects of their environments; they must be able to (ii) reason, i.e., to extract abstract concepts from observations in the real-world, logically manipulate these concepts, then leverage the results for inference on downstream tasks; and they must be able to (iii) execute morphology-, environment-, and socially-appropriate behaviors towards those high-level goals.

Despite substantial recent advancements in the use of pre-trained, large-capacity models (i.e., foundation models) for difficult robotics problems, methods still struggle in the face of several practical challenges that relate to real-world deployment, e.g., cross-domain generalization, adaptation to dynamic and human-shared environments, and lifelong operation in open-world contexts. This workshop intends to sponsor discussion of new hybrid methodologies—those that combine representations from foundation models with modeling mechanisms that may prove useful for semantic reasoning and abstract goal understanding, including neural memory mechanisms, procedural modules (e.g., cognitive architectures), neuro-symbolic representations (e.g., knowledge/scene graph embeddings), chain-of-thought reasoning mechanisms, robot skill primitives and their composition, 3D scene representations (e.g., 3DGS), etc.

Intended audience. We aim to bring together engineers, researchers, and practitioners from different communities to enable avenues for interdisciplinary research on methods that could facilitate the deployment of semantics-aware and generalizable embodied agents in unstructured and dynamic real world environments. In addition to the organizers, the presenters, panelists, and technical program committee are drawn from the following communities: Robot Learning, Embodied AI, Planning + Controls, Cognitive Robotics, Neuro-Symbolism, Natural Language Processing, Multimodal Machine Learning, Computer Vision, and Digital Twins. We likewise intend to attract an audience from these diverse sub-communities to contribute to compelling discussions.

Schedule

Time
08:30		Organizers Introductory Remarks
08:40		Keynote 1: Jesse Thomason Embracing Language as Grounded Communication Abstract Language is not text data, it is a human medium for communication. The larger part of the natural language processing (NLP) community has doubled down on treating digital text as a sufficient approximation of language, scaling datasets and corresponding models to fit that text. I have argued that experience in the world grounds language, tying it to objects, actions, and concepts. In fact, I believe that language carries meaning only when considered alongside that world, and that the zeitgeist in NLP research currently misses the mark on truly interesting questions at the intersection of human language and machine computation. In this talk, I’ll highlight some of the ways my lab enables agents and robots to better understand and respond to human communication by considering the grounded context in which that communication occurs, including neurosymbolic multimodal reasoning, natural language dialogue and interaction for lifelong learning, and utilizing NLP technologies on non-text communication. Keynote references: PSALM ProgPrompt
09:00		Keynote 2: Manolis Savva Towards Realistic & Interactive 3D Simulation for Embodied AI Abstract 3D simulators are increasingly being used to develop and evaluate "embodied AI" (agents perceiving and acting in realistic environments). Much of the prior work in this space has treated simulators as "black boxes" within which learning algorithms are to be deployed. However, the system characteristics of the simulation platforms themselves and the datasets that are used with these platforms both greatly impact the feasibility and the outcomes of experiments involving simulation. In this talk, I will describe several recent projects that outline emerging challenges and opportunities in the development of 3D simulation for embodied AI. Bio: Manolis Savva is an Associate Professor at Simon Fraser University, and a Canada Research Chair in Computer Graphics. His research focuses on analysis, organization and generation of 3D content. Prior to his current position he was a visiting researcher at Facebook AI Research, and a postdoctoral researcher at Princeton University. He received his Ph.D. from Stanford University under the supervision of Pat Hanrahan. His work has been recognized through several awards including an ACM UIST notable paper award (ReVision), an ICCV best paper nomination (Habitat), two SGP dataset awards (ShapeNet, SGP 2018; ScanNet, SGP 2020), the 2022 Graphics Interface early career researcher award, and an ICLR 2023 outstanding paper award (Emergence of Maps). Keynote references: Habitat Synthetic Scenes Dataset (HSSD) SceneMotifCoder S2O: Static to Openable CAGE: Controllable Articulation GEneration SINGAPO
09:20		Keynote 3: Dorsa Sadigh Human-Aligned Robot Learning: manipulation policies via preferences, RLHF, and VLM feedback Abstract Abstract TBD
09:40		Spotlight Talks: #1, #4, #13, #17
10:00		Keynote 4: Yonatan Bisk Semantics? Reasoning? Can we define either of those terms? Abstract In this talk I'll discuss some recent work on language conditioned robotics, but I might also choose to spend time questioning the basic assumptions of all of our work, and if we're all misguided about the important questions in robotics.
10:20		Keynote 5: Ted Xiao Full-stack Robotics Foundation Models: From Embodied Reasoning to Dexterity Abstract Advances in data-driven robot learning have accelerated progress towards general purpose robotic control. While improvements in Vision Language Action (VLA) models and large-scale imitation learning have enabled early multipurpose robotic foundation models, progress is often a direct reflection of the robot training dataset distribution or bespoke algorithmic adjustments. This stands in stark contrast to trends in multimodal frontier models, where capability improvements come not only from nuanced small-scale design decisions, but from properly harnessing the fundamental intelligence scaling laws of the underlying frontier model. In this talk, I will discuss how perspectives from frontier modeling can inspire and guide robotics research. In particular, I will cover how Gemini Robotics tackles robotics from a truly full-stack approach: how improving multimodal frontier model capabilities like embodied reasoning results in a generalizable, steerable, and dexterous robot foundation model. Keynote references: https://deepmind.google/models/gemini-robotics/
10:40		Keynote 6: Benjamin Alt Semantic Digital Twins for Robust and Flexible Robot Behavior Abstract Keynote references: Digital Twin Generation from Visual Data: A Survey Awesome Digital Twins
10:50		Coffee Break, Socializing, Posters
11:30		Keynote 7: Lerrel Pinto On Building General-Purpose Home Robots Abstract The concept of a "generalist machine" in homes — a domestic assistant that can adapt and learn from our needs, all while remaining cost-effective — has long been a goal in robotics that has been steadily pursued for decades. In this talk, I will present our recent efforts towards building such capable home robots. First, I will discuss how large, pretrained vision-language models can induce strong priors for mobile manipulation tasks like pick-and-drop. But pretrained models can only take us so far. To scale beyond basic picking, we will need systems and algorithms to rapidly learn new skills. This requires creating new tools to collect data, improving representations of the visual world, and enabling trial-and-error learning during deployment. While much of the work presented focuses on two-fingered hands, I will briefly introduce learning approaches for multi-fingered hands which support more dexterous behaviors and rich touch sensing combined with vision. Finally, I will outline unsolved problems that were not obvious initially, which, when solved, will bring us closer to general-purpose home robots. Keynote references: Robot Utility Models EgoZero: Robot Learning from Smart Glasses DynaMem Point Policy
11:50		Debate: Implicit/Data-emergent Reasoning Capabilities versus Explicit Reasoning Mechanisms? Panelists: Jesse Thomason, Ted Xiao, Manolis Savva, Lerrel Pinto, Yonatan Bisk, Benjamin Alt
12:30		Organizers Closing Remarks

Call for Papers

Targeted Topics

In addition to the RSS subject areas, we especially invite paper submissions on various topics, including (but not limited to):

Neural architectures leveraging demonstrations as prompts
Goal understanding through few-shot demonstrations
Novel abstractions, representations, and mechanisms for few-shot learning
Retrieval-augmentation mechanisms used in learning and task-execution
Agentic frameworks for failure reasoning, self-guidance, test-time adaptation, etc.
LLM-based action models for robot control; large action models
Semantic representations for generalizable policy learning
World models used for reasoning and optimization

Submission Guidelines

RSS SemRob 2025 suggests 4+N or 8+N paper length formats — i.e., 4 or 8 pages of main content with unlimited additional pages for references, appendices, etc. However, like RSS 2025, we impose no strict page length requirements on submissions; we trust that authors will recognize that respecting reviewers’ time is helpful to the evaluation of their work.

Submissions are handled through CMT: https://cmt3.research.microsoft.com/SEMROB2025

(Required acknowledgement: the Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.)

We will accept the official LaTeX or Word paper templates, provided by RSS 2025.

Our review process will be double-blind, following the RSS paper submission policy for Science/Systems papers.

All accepted papers will be invited for poster presentations; the highest-rated papers, according to the Technical Program Committee, will be given spotlight presentations. Accepted papers will be made available online on this workshop website as non-archival reports, allowing authors to also submit their works to future conferences or journals. We will highlight the Best Paper Award during the closing remarks at the workshop event.

Important Dates

Submission deadline:1 June 2025, 23:59 AOE.
Author Notifications: 10 June 2025.
Camera Ready: 15 June 2025.
Workshop: 21 June 2025, 08:30-12:30 PT

Accepted Papers

Congratulations to Paper #13 (WoMAP: World Models For Embodied Open-Vocabulary Object Localization) for winning the Best Paper Award and for Paper #1 (Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models) for winning the Best Paper Runner-up!

(Paper ID #1) Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models (spotlight) (best paper runner-up)
Xiaoyang Shi; Brian Ichter; Michael Equi; Liyiming Ke; Karl Pertsch; Quan Vuong; James Tanner; Anna Walling; Haohuan Wang; Niccolo Fusai; Adrian Li-Bell; Danny Driess; Lachy Groom; Sergey Levine; Chelsea Finn
(Paper ID #2) VERDI: VLM-Embedded Reasoning for Autonomous Driving
Zhiting Mei; Bowen Feng; Baiang Li; Julian Ost; Roger Girgis; Anirudha Majumdar; Felix Heide
(Paper ID #3) STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation Haokun Zhu; Zongtai Li; Zhixuan Liu; Wenshan Wang; Ji Zhang; Jonathan Francis; Jean Oh
(Paper ID #4) Flexible Multitask Learning with Factorized Diffusion Policy (spotlight)
Chaoqi Liu; Haonan Chen; Sigmund H. Høeg; Shaoxiong Yao; Yunzhu Li; Kris Hauser; Yilun Du
(Paper ID #5) RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration
Omar Alama; Avigyan Bhattacharya; Haoyang He; Seungchan Kim; Yuheng Qiu; Wenshan Wang; Cherie Ho; Nikhil Keetha; Sebastian Scherer
(Paper ID #6) EgoZero: Robot Learning from Smart Glasses
Vincent Liu*; Ademi Adeniji*; Haotian Zhan*; Siddhant Haldar; Raunaq Bhirangi; Pieter Abbeel; Lerrel Pinto
(Paper ID #7) Touch begins where vision ends: Generalizable policies for contact-rich manipulation
Zifan Zhao; Siddhant Haldar; Jinda Cui; Lerrel Pinto; Raunaq Bhirangi
(Paper ID #8) Feel the Force: Contact-Driven Learning from Humans
Ademi Adeniji; Zhuoran Chen; Vincent Liu; Venkatesh Pattabiraman; Siddhant Haldar; Raunaq Bhirangi; Pieter Abbeel; Lerrel Pinto
(Paper ID #9) IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
Yiyang Ling; Karan Owalekar; Oluwatobiloba Adesanya; Erdem Bıyık; Daniel Seita
(Paper ID #10) Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation
Siddhant Haldar; Lerrel Pinto
(Paper ID #11) GRIM: Task-Oriented Grasping with Conditioning on Generative Examples
Shailesh Shailesh; Alok Raj; Nayan Kumar; Priya Shukla; Andrew Melnik; Michael Beetz; Gora Chand Nandi
(Paper ID #12) Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning
Sigmund Høeg; Chaoqi Liu; Yilun Du; Olav Egeland
(Paper ID #13) WoMAP: World Models For Embodied Open-Vocabulary Object Localization (spotlight) (best paper award)
Tenny Yin; May Mei; Tao Sun; Lihan Zha; Jeremy Bao; Emily Zhou; Miyu Yamane; Ola Shorinw; Anirudha Majumdar
(Paper ID #14) Scene Graph-Guided Proactive Replanning for Failure-Resilient Embodied Agents
Che Rin Yu; Daewon Chae; Dabin Seo; Yoonha Jang; Sangwon Lee; Hyeongwoo IM; Jinkyu Kim
(Paper ID #15) CASPER: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models
Huihan Liu; Rutav Shah; Shuijing Liu; Jack Pittenger; Mingyo Seo; Yuchen Cui; Yonatan Bisk; Roberto Martin-Martin; Yuke Zhu
(Paper ID #16) Grounding Language Models with Semantic Digital Twins for Robotic Planning
Mehreen Naeem; Andrew Melnik; Michael Beetz
(Paper ID #17) MotIF: Motion Instruction Fine-tuning (spotlight)
Minyoung Hwang; Joey Hejna; Dorsa Sadigh; Yonatan Bisk
(Paper ID #18) Mixed Initiative Dialog for Human-Robot Collaborative Mobile Manipulation
Albert Yu; Chengshu Li; Luca Macesanu; Arnav Balaji; Ruchira Ray; Ray Mooney; Roberto Roberto Martín-Martín
(Paper ID #19) GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance
Arthur Fender Coelho Bucker; Pablo Ortega Kral; Jonathan Francis; Jean Oh
(Paper ID #20) Points2Reward: Robotic Manipulation Rewards from Just One Video
Junyao Shi; Joshua Smith; Jianing Qian; Dinesh Jayaraman
(Paper ID #21) Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining
Yaru Niu; Yunzhe Zhang; Mingyang Yu; Changyi Lin; Chenhao Li; Yikai Wang; Yuxiang Yang; Wenhao Yu; Tingnan Zhang; Zhenzhen Li; Jonathan Francis; Bingqing Chen; Jie Tan; Ding Zhao
(Paper ID #22) GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Blake Buchanan; Saumya Saxena; Chris Paxton; Bingqing Chen; Jonathan Francis; Narunas Vaskevicius; Luigi Palmieri; Peiqi Liu; Oliver Kroemer