Gesture Only
This baseline only uses gesture to interact with the robot. Here we implement several buttons to represent actions.
In this paper, we introduce RobiButler, a novel household robotic system that enables multimodal interactions with remote users Building on the advanced communication interfaces, RobiButler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language Models (LLMs), that interprets multimodal instructions to generate action plans. These plans are composed by a set of open vocabulary primitives supported by Vision Language Models (VLMs) that handle both text and pointing queries. The integration of the above components allows RobiButler to ground remote multimodal instructions in the real-world home environment in a zero-shot manner. We demonstrate the effectiveness and efficiency of this system using a variety of daily household tasks that involve remote users giving multimodal instructions. Additionally, we conducted an intensive user study to deeply analysis how multimodal interaction affects efficiency and user experience during remote human-robot interaction.
The robot system consists of three components: Communication Interfaces, High-level Behavior Manager, and Fundamental Skills. The Communication Interfaces transmit the inputs received from the remote user to the High-level Behavior Module, which composes the Fundamental Skill to interact with the environment to fulfill the instructions or answer questions.
This baseline only uses gesture to interact with the robot. Here we implement several buttons to represent actions.
This baseline only uses voice to interact with the robot. We adopt TIO as the visual grounding module.