Human-like Navigation in a World Built for Humans
- University of Illinois Urbana-Champaign
- * Equal Contribution
CoRL 2025
Abstract
When navigating in a man-made environment they haven't visited before—like an office building—humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilites of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.
Method
We let a VLM agent choose navigation landmarks, leveraging its reasoning abilities to recognize patterns like ascending room numbers, while abstracting away details regarding complex spatial data and precise numerical control.
ReasonNav is a modular system that integrates human-like navigation behaviors through a Vision-Language Model (VLM) agent. While VLMs excel at language understanding and commonsense reasoning, they struggle with complex spatial data and precise numerical outputs. To address this, we design compact input and output abstractions centered on the concept of landmarks—salient objects critical for navigation, including doors, people, directional signs, and map frontiers.
Our system maintains a memory bank that stores detected landmarks along with navigation-relevant information gathered through interaction.
For doors, we attach room label text; for people, we store summaries of directions they provide;
for signs, we record cardinal directions and their associated text. The VLM receives this information in two forms:
a JSON-formatted memory bank and a top-down map visualization with landmarks plotted by category and index.
Based on the VLM's selection, ReasonNav executes predefined behavior primitives tailored to each landmark type.
This design enables the VLM to employ higher-order reasoning—such as following ascending room numbers or interpreting directional signs—
without being burdened by low-level control. The modular architecture separates high-level decision-making (VLM-driven) from
low-level execution (localization, mapping, and path planning), allowing ReasonNav to navigate efficiently in large, complex environments
through human-like exploration strategies.
Walkthrough
In this video, we show a successful demonstration of ReasonNav delivering a water bottle to a Professor's office in an unknown environment. The target room number obtained algorithmically via a simple web search of the Professor's office. Step-by-step reasoning of the VLM is shown in addition to a birds-eye-view map and the Realsense camera view.
Skills
Based on the selected navigation landmark, the robot will execute one of four behavior primitives: reading signs, reading room numbers, asking people for directions, and exploring frontiers.
Read Signs
Check Doors
Explore Frontiers
Examples
Multi-floor Example
Citation
If you find our work useful in your research, please consider citing:
@inproceedings{chandaka2025reasonnav, author={Chandaka, Bhargav and Wang, Gloria and Chen, Haozhe and Che, Henry and Zhai, Albert and Wang, Shenlong}, title={Human-like Navigation in a World Built for Humans}, booktitle={Conference on Robot Learning}, year={2025} }