Imagine being lost in a sprawling office building, a massive warehouse, or even a bustling department store. Now, imagine asking the nearest robot for directions. This futuristic scenario is quickly becoming reality, thanks to a revolutionary new navigation system developed by Google researchers.
Their innovative approach combines the power of natural language processing and computer vision, allowing robots to understand and respond to both verbal and visual instructions. Previously, robotic navigation required meticulous mapping and specific physical coordinates to guide the machine. However, recent advancements in what’s called Vision Language navigation have enabled users to simply give robots natural language commands like “go to the workbench.”
Google’s researchers have taken this concept a step further, incorporating multimodal capabilities. Now, users can provide both natural language and visual cues to direct the robot. For instance, in a warehouse, a user could point to an item and ask, “what shelf does this go on?” Leveraging the impressive processing power of Gemini 1.5 Pro, the AI interprets both the spoken question and the visual information. This enables the AI to not only understand the request but also formulate a precise navigation path, leading the user directly to the correct shelf.
The robots have been tested with a wide range of commands, demonstrating their versatility. For example, they can respond to requests like, “Take me to the conference room with the double doors,” “Where can I borrow some hand sanitizer,” or even “I want to store something out of sight from public eyes. Where should I go?”
In one demonstration, a researcher activates the system with an “OK robot” before asking to be led somewhere “he can draw.” The robot responds with “give me a minute. Thinking with Gemini…” before confidently embarking on a journey through the 9,000-square-foot DeepMind office in search of a large whiteboard.
While these impressive robots are already familiar with the office layout, the team developed a technique called “Multimodal Instruction Navigation with demonstration Tours (MINT).” This involves first manually guiding the robot through the environment, pointing out specific areas and features using natural language. The same effect can be achieved by simply recording a video of the space using a smartphone. From this data, the AI generates a topological graph, matching what its cameras are seeing with the “goal frame” from the demonstration video.
The team then utilizes a hierarchical Vision-Language-Action (VLA) navigation policy, “combining the environment understanding and common sense reasoning.” This policy instructs the AI on how to translate user requests into navigational actions.
The results of these tests were highly successful, with the robots achieving “86 percent and 90 percent end-to-end success rates on previously infeasible navigation tasks involving complex reasoning and multimodal user instructions in a large real world environment,” according to the researchers.
Despite this impressive progress, the researchers acknowledge that there is still room for improvement. The robots cannot yet autonomously perform their own demonstration tour. Furthermore, the AI’s inference time, the time it takes to formulate a response, is currently 10 to 30 seconds, which can be a test of patience for users.
This research represents a significant step forward in robotic navigation. The ability of these robots to understand and respond to both verbal and visual cues has the potential to revolutionize the way we interact with and navigate complex environments. As the technology continues to evolve, we can expect to see even more sophisticated robots that can seamlessly assist us in our daily lives.