CRANE Iconrane: Context-guided Prompt learning and Attention Refinement for Zero-shot Anomaly Detection

Alireza Salehy1, Mohammadreza Salehi2,
Reshad Hosseini1, Cees G. M. Snoek2, Makoto Yamada3, Mohammad Sabokrou3
1 University of Tehran, 2 University of Amsterdam, 3 Okinawa Institute of Science and Technology

Abstract

Anomaly Detection (AD) is crucial for medical diagnostics and industrial defect detection. Traditional AD methods rely on normal training samples, but collecting such data is often impractical. Additionally, these methods struggle with generalization across domains.

Recent advancements like AnomalyCLIP and AdaCLIP leverage CLIP’s zero-shot generalization but face challenges in bridging the gap between image-level and pixel-level anomaly detection.

🚀 Context-guided Prompt learning and Attention Refinement for Zero-shot anomaly detection (Crane) improves upon these by:

  • Context-Guided Prompt Learning: Dynamically conditioning text prompts using image context.
  • Attention Refinement: Modifying the CLIP vision encoder to enhance feature extraction for fine-grained anomaly detection.

As shown in the radar plot below, our method achieves state-of-the-art results, improving image-level detection accuracy by +0.9% to +4.9% and pixel-level anomaly localization by +2.8% to +29.6% across 14 datasets in industrial and medical domain, demonstrating its effectiveness at both anomaly localization and detection.

Quantitative Result

Method

We propose a unified framework that utilizes CLIP as a zero-shot backbone \( M_{\theta} \) for classification and segmentation while adapting it for anomaly detection to bridge the domain gap between CLIP’s pretraining and specialized anomaly detection tasks. As shown in figure bellow, we learn class-agnostic input prompts \( P \) and trainable tokens inserted into the text encoder \( \Phi_t \), guided by visual feedback from the vision encoder \( \Phi_v \).

To handle dense prediction, we introduce the spatially aligned E-Attn branch, which enhances image-text alignment by refining CLIP’s attention, and the D-Attn branch, integrating knowledge from a strong vision encoder such as DINOv2—despite its lack of inherent zero-shot compatibility—for finer-grained refinements.

Finally, we introduce a score-based pooling mechanism that fuses anomalous dense features into the global image embedding, yielding a more anomaly-aware global embedding enabling robust pixel- and image-level zero-shot generalization across previously unseen domains, as detailed below. The figure shows an overview of our approach.

Overview of Crane framework. The model integrates Context-Guided Prompt Learning and Attention Refinement into the CLIP framework, improving both localization and detection capabilities for zero-shot anomaly detection.

Quantitative Results

Qualitative Results

BibTeX