PIXER: Learning Visual Information Utility

* Equal contribution
Center for Embodied Autonomy and Robotics (CEAR)
University at Buffalo

Abstract

Accurate feature detection is fundamental for various computer vision tasks including autonomous robotics, 3D reconstruction, medical imaging, and remote sensing. Despite advancements in enhancing the robustness of visual features, no existing method measures the utility of visual information be- fore processing by specific feature-type algorithms. To address this gap, we introduce PIXER and the concept of “Featureness”, which reflects the inherent interest and reliability of visual information for robust recognition independent of any specific feature type. Leveraging a generalization on Bayesian learning, our approach quantifies both the probability and uncertainty of a pixel's contribution to robust visual utility in a single- shot process, avoiding costly operations such as Monte Carlo sampling, and permitting customizable featureness definitions adaptable to a wide range of applications. We evaluate PIXER on visual-odometry with featureness selectivity, achieving an average of 31% improvement in RMSE trajectory with 49% fewer features.

Video Presentation

Coming Soon!

Method

The training of PIXER is a three-step process. First, we train a network with a general understanding of interestingness (i.e., feature point detection) where we make use of SiLK in this work (top left). Next, we convert this model to a Bayesian Neural Network (BNN) and train again using the addition of probabilistic losses (e.g., KL Divergence, top middle). Finally, we train a specialized uncertainty head using feature variance computed by Monte Carlo supervision from the BNN (top right). The PIXER inference model is then the joint feature-point probability and uncertainty networks (bottom middle). The combination of pixel-wise probability and uncertainty forms our definition of featureness F (bottom right), used to describe the general utility of the visual information.

The Davis Dataset

We evaluate PIXER aided visual odometry on a custom dataset, named "Davis", collected using a ZED 2i camera + Mosaic X5 GNSS on a Boston Dynamics Spot Quadruped. Results in Table below show superior estimation performance with mean RMSE improvement of 34% and mean feature reduction of 41%.

Results

Visual odometry (VO) performance results. Features when filtered using PIXER contribute to a lower RMSE (31% on average across all datasets) and frame-to-frame execution time for VO estimation (0.63% despite the inclusion of model inference). This enables using lighter, faster features like Shi-Tomasi while achieving performance better than SIFT (e.g., KITTI & Davis). We see a considerable reduction in the number of keypoints in all datasets by roughly 49% (KP% shows % reduction while KPmean shows the mean number mean of keypoints extracted per image). mean(FArea) is the average percentage of pixels masked with F.

BibTeX

@inproceedings{turkar2024pixer,
        title={Learning Visual Information Utility with Pixer},
        author={Yash Turkar,Timothy Chase Jr, Christo Aluckal and Karthik Dantu},
        year={2024}
      }