PIXER: Enhancing Visual Odometry with Reliable Pixel Masking

Conference on Robots and Vision (CRV) 2026

* Equal contribution
Center for Embodied Autonomy and Robotics (CEAR)
University at Buffalo

Abstract

Robust feature detection and matching are fun- damental for visual odometry and SLAM, yet most methods lack a principled measure of a feature’s reliability prior to downstream use. We present PIXER, a learning-based method that predicts the interest reliability of each pixel for feature-based visual navigation. RPM is designed as a lightweight, single-shot model that outputs dense reliability maps from a single image, using a generalized Bayesian formulation without requiring Monte Carlo sampling. These outputs are used to selectively filter low-utility features prior to matching, improving downstream matching and pose estimation. Integrated into a standard visual odometry pipeline, RPM improves average trajectory accuracy by 31% while reducing feature usage by 49% across eight different feature detectors. Our results demonstrate that pre-filtering input imagery based on learned reliability enhances the robustness and efficiency of SLAM systems. Code, models, and datasets will be made publicly available upon publication.

Video Presentation

Coming Soon!

Method

The training of PIXER is a three-step process. First, we train a network with a general understanding of interestingness (i.e., feature point detection) where we make use of SiLK in this work (top left). Next, we convert this model to a Bayesian Neural Network (BNN) and train again using the addition of probabilistic losses (e.g., KL Divergence, top middle). Finally, we train a specialized uncertainty head using feature variance computed by Monte Carlo supervision from the BNN (top right). The PIXER inference model is then the joint feature-point probability and uncertainty networks (bottom middle). The combination of pixel-wise probability and uncertainty forms our definition of featureness F (bottom right), used to describe the general utility of the visual information.

The Davis Dataset

We evaluate PIXER aided visual odometry on a custom dataset, named "Davis", collected using a ZED 2i camera + Mosaic X5 GNSS on a Boston Dynamics Spot Quadruped. Results in Table below show superior estimation performance with mean RMSE improvement of 34% and mean feature reduction of 41%.

Results

Visual odometry (VO) performance results. Features when filtered using PIXER contribute to a lower RMSE (31% on average across all datasets) and frame-to-frame execution time for VO estimation (0.63% despite the inclusion of model inference). This enables using lighter, faster features like Shi-Tomasi while achieving performance better than SIFT (e.g., KITTI & Davis). We see a considerable reduction in the number of keypoints in all datasets by roughly 49% (KP% shows % reduction while KPmean shows the mean number mean of keypoints extracted per image). mean(FArea) is the average percentage of pixels masked with F.

BibTeX

@inproceedings{
      turkar2026enhancing,
      title={Enhancing Visual Odometry with Reliable Pixel Masking},
      author={Yash Turkar and Timothy Chase and Christo Aluckal and Karthik K Dantu},
      booktitle={23rd Conference on Robots and Vision},
      year={2026},
      url={https://openreview.net/forum?id=ZgWAW6mIQh}
      }