Visual navigation using only a single camera and a topological map has
              recently become an appealing alternative to methods that require
              additional sensors and 3D maps. This is typically achieved through an
              image-relative approach to estimating control from a given pair
              of current observation and subgoal image. However, image-level
              representations of the world have limitations because images are strictly
              tied to the agent's pose and embodiment. In contrast, objects, being a
              property of the map, offer an embodiment- and trajectory-invariant world
              representation. In this work, we present a new paradigm of learning
              object-relative control that exhibits several desirable
              characteristics:
              
              a) New routes can be traversed without strictly
              requiring to imitate prior experience,
              
              b) The control prediction
              problem can be decoupled from solving the image matching problem,
              and
              
              c) High invariance can be achieved in cross-embodiment
              deployment for variations across both training-testing and
              mapping-execution settings.
              
              We propose a topometric map representation
              in the form of a relative 3D scene graph, which is used to
              obtain more informative object-level global path planning costs. We train
              a local controller, dubbed ObjectReact, conditioned directly on
              a high-level "WayObject Costmap"
              representation that eliminates the
              need for an explicit RGB input. We demonstrate the advantages of learning
              object-relative control over its
              image-relative counterpart across sensor height variations and multiple
              navigation tasks that challenge the underlying spatial understanding
              capability, e.g., navigating a map trajectory in the reverse direction.
              We further show that our sim-only policy is able to generalize well to
              real-world indoor environments.
            
 
      Phone Video: Used to generate the object-level map.
a) Cross-embodiment deployment between mapping and execution
b) Avoids new obstacles not present during mapping
c) Low-light adaptation
d) Alt goal tasks
Long mapping video: The mapping run takes a longer
path to the goal (using a phone camera).
Deployment: The robot takes a direct path
to the cardboard cutout goal.
 
       
       
      
      @inproceedings{garg2025objectreact,
        title={ObjectReact: Learning Object-Relative Control for Visual Navigation},
        author={Garg, Sourav and Craggs, Dustin and Bhat, Vineeth and Mares, Lachlan and Podgorski, Stefan and Krishna, Madhava and Dayoub, Feras and Reid, Ian},
        booktitle={Conference on Robot Learning},
        year={2025},
        organization={PMLR}
      }