Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments

Autonomous exploration has many important applications. However, classic information gain-based or frontier-based exploration only relies on the robot current state to determine the immediate exploration goal, which lacks the capability of predicting the value of future states and thus leads to inefficient exploration decisions. This paper presents a method to learn how “good” states are, measured by the state value function, to provide a guidance for robot exploration in real-world challenging environments. We formulate our work as a off-policy evaluation (OPE) problem for robot exploration (OPERE). It consists of offline Monte-Carlo training on real-world data and performs Temporal Difference (TD) online adaptation to optimize the trained value estimator. We also design an intrinsic reward function based on sensor information coverage to enable the robot to gain more information with sparse extrinsic rewards. Results demonstrate that our method enables the robot to predict the value of future states so as to better guide robot exploration. The proposed algorithm achieves better prediction performance compared with other state-of-the-art OPE methods. To the best of our knowledge, this work for the first time demonstrates value function prediction on real-world dataset for robot exploration in challenging subterranean and urban environments.

Method Overview

Our method consisits of offline learning and online adaptation. First we collect datasets which consist of camera images and projected map images. Then we feed the data to the value function network and perform offline MC learning, where the camera image and map projection image are sent to the encoders in parallel and then aggregated together to obtain the state value function. During the online deployment, we perform one additional TD adaptation step and get the refined value function.

Datasets Collection Environments

Snapshots of the data collection environments. Here we show the 3D reconstructed occupancy grid map as well as the captured images captured (At the corner of each subfigure) during exploration. From left to right and top to bottom: Auditorium corridor, Large open room, Limestone mine and Natural cave.

Experiment Results

With the learned value function, robot could make better decisions.

Regret Analysis in Corridor Environment (left) and Cave Environment (right).

Real Robot Experiments

Robot explores with learned value function.

Third person views (left column), bag files replay (right column).

Exploration Behaviors Compared with Frontier-based Method

Ours with Learned Value (left column), Frontier-based Method (right column). With learned value function, our method is able to explore high value regions which frontier-based method fails.

BibTeX

@article{2022opere,
       author    = {Yafei Hu and Junyi Geng and Chen Wang and John Keller and Sebastian Scherer},
       title     = {Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments},
       journal   = {IEEE Robotics and Automation Letters},
       year      = {2023},
       volume    = {8},
       number    = {6},
       pages     = {3780-3787},
       doi       = {10.1109/LRA.2023.3271520}
}