PIXER : Learning Visual Information Utilty

PIXER: Learning Visual Information Utility

^* Equal contribution

Center for Embodied Autonomy and Robotics (CEAR)
University at Buffalo

In this work, we introduce the concept of “Featureness”: the inherent interest and reliability of visual information for scene understanding in autonomous systems. Using a foundation of feature interest-point learning, PIXER learns to output pixel-wise probability and uncertainty information that estimates a definition of featureness tailorable to a wide range of applications. PIXER formulates a generalization on stochastic Bayesian learning to estimate featureness in a lightweight, single-shot architecture, where we demonstrate immediate utility by improving performance in visual odometry through the selective filtering of numerous front-end feature types. Examples of different feature types are shown in the top left image (SIFT blue, ORB green), while the top right shows these keypoints after PIXER selection (red features are removed). The bottom displays probability (P) and uncertainty (U) heatmaps output by PIXER with featureness mask (F) generated by selectively fusing P and U.

Abstract

Accurate feature detection is fundamental for various computer vision tasks including autonomous robotics, 3D reconstruction, medical imaging, and remote sensing. Despite advancements in enhancing the robustness of visual features, no existing method measures the utility of visual information be- fore processing by specific feature-type algorithms. To address this gap, we introduce PIXER and the concept of “Featureness”, which reflects the inherent interest and reliability of visual information for robust recognition independent of any specific feature type. Leveraging a generalization on Bayesian learning, our approach quantifies both the probability and uncertainty of a pixel's contribution to robust visual utility in a single- shot process, avoiding costly operations such as Monte Carlo sampling, and permitting customizable featureness definitions adaptable to a wide range of applications. We evaluate PIXER on visual-odometry with featureness selectivity, achieving an average of 31% improvement in RMSE trajectory with 49% fewer features.

Method

The training of PIXER is a three-step process. First, we train a network with a general understanding of interestingness (i.e., feature point detection) where we make use of SiLK in this work (top left). Next, we convert this model to a Bayesian Neural Network (BNN) and train again using the addition of probabilistic losses (e.g., KL Divergence, top middle). Finally, we train a specialized uncertainty head using feature variance computed by Monte Carlo supervision from the BNN (top right). The PIXER inference model is then the joint feature-point probability and uncertainty networks (bottom middle). The combination of pixel-wise probability and uncertainty forms our definition of featureness F (bottom right), used to describe the general utility of the visual information.

The Davis Dataset

We evaluate PIXER aided visual odometry on a custom dataset, named "Davis", collected using a ZED 2i camera + Mosaic X5 GNSS on a Boston Dynamics Spot Quadruped. Results in Table below show superior estimation performance with mean RMSE improvement of 34% and mean feature reduction of 41%.

Results

Visual odometry (VO) performance results. Features when filtered using PIXER contribute to a lower RMSE (31% on average across all datasets) and frame-to-frame execution time for VO estimation (0.63% despite the inclusion of model inference). This enables using lighter, faster features like Shi-Tomasi while achieving performance better than SIFT (e.g., KITTI & Davis). We see a considerable reduction in the number of keypoints in all datasets by roughly 49% (KP% shows % reduction while KPmean shows the mean number mean of keypoints extracted per image). mean(FArea) is the average percentage of pixels masked with F.

@inproceedings{turkar2024pixer, title={Learning Visual Information Utility with Pixer}, author={Yash Turkar,Timothy Chase Jr, Christo Aluckal and Karthik Dantu}, year={2024} }

PIXER: Learning Visual Information Utility

Abstract

Video Presentation

Method

The Davis Dataset

Results

BibTeX