RobiButler: Remote Multimodal Interactions with Household Robot Assistant

Under construction

School of Computing & Smart Systems Institute
National University of Singapore

Abstract

In this paper, we introduce RobiButler, a novel household robotic system that enables multimodal interactions with remote users Building on the advanced communication interfaces, RobiButler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language Models (LLMs), that interprets multimodal instructions to generate action plans. These plans are composed by a set of open vocabulary primitives supported by Vision Language Models (VLMs) that handle both text and pointing queries. The integration of the above components allows RobiButler to ground remote multimodal instructions in the real-world home environment in a zero-shot manner. We demonstrate the effectiveness and efficiency of this system using a variety of daily household tasks that involve remote users giving multimodal instructions. Additionally, we conducted an intensive user study to deeply analysis how multimodal interaction affects efficiency and user experience during remote human-robot interaction.




Overall Framework

The robot system consists of three components: Communication Interfaces, High-level Behavior Manager, and Fundamental Skills. The Communication Interfaces transmit the inputs received from the remote user to the High-level Behavior Module, which composes the Fundamental Skill to interact with the environment to fulfill the instructions or answer questions.

Framework for RobiButler



Tasks for Experiments



The baselines in Experiment II


Gesture Only

This baseline only uses gesture to interact with the robot. Here we implement several buttons to represent actions.

Voice Only

This baseline only uses voice to interact with the robot. We adopt TIO as the visual grounding module.