Anomaly Detection (AD) is crucial for medical diagnostics and industrial defect detection. Traditional AD methods rely on normal training samples, but collecting such data is often impractical. Additionally, these methods struggle with generalization across domains.
Recent advancements like AnomalyCLIP and AdaCLIP leverage CLIP’s zero-shot generalization but face challenges in bridging the gap between image-level and pixel-level anomaly detection.
🚀 Context-guided Prompt learning and Attention Refinement for Zero-shot anomaly detection (Crane) improves upon these by:
As shown in the radar plot below, our method achieves state-of-the-art results, improving image-level detection accuracy by +0.9% to +4.9% and pixel-level anomaly localization by +2.8% to +29.6% across 14 datasets in industrial and medical domain, demonstrating its effectiveness at both anomaly localization and detection.
We propose a unified framework that utilizes CLIP as a zero-shot backbone \( M_{\theta} \) for classification and segmentation while adapting it for anomaly detection to bridge the domain gap between CLIP’s pretraining and specialized anomaly detection tasks. As shown in figure bellow, we learn class-agnostic input prompts \( P \) and trainable tokens inserted into the text encoder \( \Phi_t \), guided by visual feedback from the vision encoder \( \Phi_v \).
To handle dense prediction, we introduce the spatially aligned E-Attn branch, which enhances image-text alignment by refining CLIP’s attention, and the D-Attn branch, integrating knowledge from a strong vision encoder such as DINOv2—despite its lack of inherent zero-shot compatibility—for finer-grained refinements.
Finally, we introduce a score-based pooling mechanism that fuses anomalous dense features into the global image embedding, yielding a more anomaly-aware global embedding enabling robust pixel- and image-level zero-shot generalization across previously unseen domains, as detailed below. The figure shows an overview of our approach.
Comparison of ZSAD methods in the industrial domain. Unlike AnomalyCLIP and AdaCLIP, which fail to achieve consistent improvements, both versions of our model advance the state-of-the-art in image-level and pixel-level metrics.
Comparison of ZSAD methods in the medical domain. Both versions of our model achieve state-of-the-art performance at the image-level, while our full model sets a new benchmark in pixel-level performance, and the other remains competitive with AnomalyCLIP in segmentation.
Comparison of localization of ZSAD methods. Crane benefits from a stronger semantic correlation among patches, which improves the true positive rate while reducing false positives simultaneously, demonstrating its superior Zero-shot Anomaly Detection performance.
Zero-shot localization of Crane. Anomaly map of Crane over several categories in VisA and MVTec-AD. As shown Crane, cleanly outlines anomaly regions even fine-graineds.
Zero-shot localization of Crane. Anomaly map of Crane over several categories in MPDD, DTD-Synthethic and DAGM. As shown Crane, cleanly outlines anomaly regions even fine-graineds.