VISION: Visual Inspection System with Intelligent Observation and Navigation

* Equal contribution
Center for Embodied Autonomy and Robotics (CEAR)
University at Buffalo
Code (Coming Soon) INSPECTOR

Preliminary results at Culvert 110 (Gasport, NY). A query image (right) is processed by an open-vocabulary VLM to produce region proposals (red boxes) with natural-language rationales and normalized follow-up probabilities, distributing attention across the scene. The viewpoint planner (center) uses geometry/diameter estimates to generate gimbal-feasible next-best views in the culvert coordinate frame. Executing these poses yields targeted, high-resolution inspection imagery (left), which documents features such as concentrated cracking and rough joints, possible micro-cracking/spalling near the throat, white efflorescence/mineral deposits consistent with moisture ingress, and dark wet staining/pooling indicative of seepage. Together the panels illustrate VISION’s closed-loop see → decide → move → re-image workflow from proposals to actionable, validated observations.

Abstract

The culverts beneath the Erie Canal demand frequent, high-fidelity inspection due to age, wear, and heterogeneous environments. Long-tailed, site-specific degradation modes, limited labeled data, and shifting imaging conditions undermine closed-set detection and segmentation approaches. Open-vocabulary vision–language models (VLMs) offer a path around taxonomy lock-in, but remain difficult to adapt and fine-tune for niche infrastructure domains. We introduce VISION, an end-to-end autonomous inspection pipeline that couples web-scale VLMs with viewpoint planning to close the loop—see → decide → move → re-image. Deployed at Culvert 110 (Gasport, NY), VISION repeatedly and accurately localized, prioritized, and re-imaged defects, capturing targeted, high-resolution inspection imagery while producing structured descriptions that support downstream condition assessment.

BibTeX

@inproceedings{turkar2025VISION,
        title={Visual Inspection System with Intelligent Observation and Navigation with VISION},
        author={Yash Turkar,Yashom Dighe and Karthik Dantu},
        year={2025}
      }