ReasonNav

Human-like Navigation in a World Built for Humans

University of Illinois Urbana-Champaign

^* Equal Contribution

CoRL 2025

Abstract

When navigating in a man-made environment they haven't visited before—like an office building—humans employ behaviors such as reading signs and asking others for directions. These behaviors help humans reach their destinations efficiently by reducing the need to search through large areas. Existing robot navigation systems lack the ability to execute such behaviors and are thus highly inefficient at navigating within large environments. We present ReasonNav, a modular navigation system which integrates these human-like navigation skills by leveraging the reasoning capabilites of a vision-language model (VLM). We design compact input and output abstractions based on navigation landmarks, allowing the VLM to focus on language understanding and reasoning. We evaluate ReasonNav on real and simulated navigation tasks and show that the agent successfully employs higher-order reasoning to navigate efficiently in large, complex buildings.

Method

💡 Key Idea

We let a VLM agent choose navigation landmarks, leveraging its reasoning abilities to recognize patterns like ascending room numbers, while abstracting away details regarding complex spatial data and precise numerical control.

ReasonNav is a modular system that integrates human-like navigation behaviors through a Vision-Language Model (VLM) agent. While VLMs excel at language understanding and commonsense reasoning, they struggle with complex spatial data and precise numerical outputs. To address this, we design compact input and output abstractions centered on the concept of landmarks—salient objects critical for navigation, including doors, people, directional signs, and map frontiers.

Our system maintains a memory bank that stores detected landmarks along with navigation-relevant information gathered through interaction. For doors, we attach room label text; for people, we store summaries of directions they provide; for signs, we record cardinal directions and their associated text. The VLM receives this information in two forms: a JSON-formatted memory bank and a top-down map visualization with landmarks plotted by category and index. Based on the VLM's selection, ReasonNav executes predefined behavior primitives tailored to each landmark type.

This design enables the VLM to employ higher-order reasoning—such as following ascending room numbers or interpreting directional signs— without being burdened by low-level control. The modular architecture separates high-level decision-making (VLM-driven) from low-level execution (localization, mapping, and path planning), allowing ReasonNav to navigate efficiently in large, complex environments through human-like exploration strategies.

Walkthrough

In this video, we show a successful demonstration of ReasonNav delivering a water bottle to a Professor's office in an unknown environment. The target room number obtained algorithmically via a simple web search of the Professor's office. Step-by-step reasoning of the VLM is shown in addition to a birds-eye-view map and the Realsense camera view.

Skills

Based on the selected navigation landmark, the robot will execute one of four behavior primitives: reading signs, reading room numbers, asking people for directions, and exploring frontiers.

Read Signs

Check Doors

Ask People

Explore Frontiers

Examples

Real

Sim

Multi-floor Example

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{chandaka2025reasonnav, author={Chandaka, Bhargav and Wang, Gloria and Chen, Haozhe and Che, Henry and Zhai, Albert and Wang, Shenlong}, title={Human-like Navigation in a World Built for Humans}, booktitle={Conference on Robot Learning}, year={2025} }

Acknowledgements

The website template was borrowed from ClimateNerf and Sim-on-Wheels.