MirrorCheck: Efficient Adversarial
Defense for Vision-Language Models

1MBZUAI 2CVLAB, EPFL 3University of Michigan *Equal contribution
Distinguished Paper Award at the 6th CVPR AdvML@CV
MBZUAI logo EPFL logo

Abstract

Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically designed to bypass existing defenses. To address this vulnerability, we propose MirrorCheck, a robust and model-agnostic detection framework that operates effectively in both unimodal and multimodal settings.

MirrorCheck leverages Text-to-Image (T2I) models to regenerate visual content from captions produced by the target model and assesses semantic consistency by comparing feature-space embeddings between the original and synthesized images. To enhance robustness against adaptive attacks, MirrorCheck introduces a stochastic defense strategy that randomly selects T2I generators and image encoders from a diverse model zoo.

Additionally, we incorporate a novel One-Time-Use (OTU) perturbation applied to the selected encoder embeddings, regulated by a scaling factor, which decreases the effectiveness of adaptive attacks. Extensive experiments across multiple threat scenarios demonstrate that MirrorCheck consistently outperforms baseline methods, and maintains its utility even under strong adaptive adversarial conditions.

MirrorCheck Pipeline

MirrorCheck Pipeline: At inference time, our framework detects adversarial attacks through a simple three-step process: (1) Extract: An input image is passed through the victim model to generate an output (caption / description / answer / classification). (2) Generate: The output is fed to a randomly selected T2I diffusion model, returning a generated image. (3) Compare: We extract and compare feature embeddings from both original and generated images using randomly selected and uniquely perturbed image encoders. Significant embedding discrepancies indicate potential adversarial attacks.

Quantitative Results

MirrorCheck consistently outperforms both traditional unimodal and recent multimodal baseline defenses across a wide variety of VLM architectures and attack types. Below are the comprehensive evaluation tables.

Table 1: Adversarial detection performance across VLM attacks

Victim Model Attack Setting Unimodal Approaches Multimodal Approaches Ours
FSMagNetPuVAEDiffPure CIDERNaiveCLIPJailGuardSmoothVLMDPS MCStoch-MC
UniDiffuser AttackVLM-T 0.560.740.510.80 0.840.680.590.810.820.83 0.960.95
AttackVLM-Q 0.650.850.700.81 0.800.650.570.830.830.85 0.980.98
BLIP AttackVLM-T 0.520.600.500.71 0.810.660.610.790.770.81 0.900.93
AttackVLM-Q 0.570.650.800.76 0.850.640.550.840.810.84 0.890.97
BLIP-2 AttackVLM-T 0.610.730.520.80 0.840.700.620.820.800.86 0.930.94
AttackVLM-Q 0.610.850.720.83 0.770.670.580.800.780.83 0.920.99
Attack-Bard ---0.79 0.870.650.580.890.870.95 0.980.95
Img2Prompt AttackVLM-T 0.510.560.500.67 0.830.610.560.830.830.86 0.790.90
AttackVLM-Q -0.650.780.69 0.790.600.550.810.740.82 0.850.92
LLaVA Attack-MMFM ---0.67 0.830.620.520.850.850.85 0.820.85
OpenFlamingo Attack-MMFM ---0.65 0.840.600.510.870.840.86 0.810.81
MiniGPT-4 AttackVLM-T 0.540.510.530.62 0.850.570.510.850.800.85 0.660.67

Our MirrorCheck variants consistently outperform both unimodal approaches and multimodal VLM-specific methods, achieving superior detection rates as high as 0.99.

Table 2: Similarity scores between original and regenerated images using Stochastic MirrorCheck

Victim Model Task Attack Setting 1 Encoder 3 Encoders 5 Encoders 7 Encoders 10 Encoders
5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3
UniDiffuser IC Clean 0.7210.6240.740 0.6510.7010.685 0.6940.6470.715 0.6480.6930.670 0.6650.6700.692
AttackVLM-T 0.5020.3410.568 0.3700.5010.477 0.4940.3990.549 0.4020.4720.458 0.4240.4410.494
AttackVLM-Q 0.4980.2940.542 0.3360.4480.395 0.4240.3320.446 0.3400.4110.375 0.3660.3720.408
BLIP IC Clean 0.7070.6100.730 0.6330.6860.672 0.6760.6280.700 0.6290.6750.652 0.6470.6530.676
AttackVLM-T 0.4810.3230.547 0.3490.4500.454 0.4190.3620.480 0.3530.4230.412 0.3750.3910.448
AttackVLM-Q 0.5080.2990.555 0.3500.4600.460 0.4360.3460.407 0.3540.4240.389 0.3790.3840.421
BLIP-2 IC Clean 0.7290.6360.744 0.6550.7050.687 0.6970.6640.718 0.6510.6950.684 0.6680.6750.695
AttackVLM-T 0.5040.3450.563 0.3760.4730.475 0.4430.3810.503 0.3770.4470.434 0.3990.4130.467
ID Attack-Bard 0.4840.4220.536 0.3790.4440.498 0.3990.4160.468 0.3770.4510.427 0.3990.4200.461
Img2Prompt VQA Clean 0.6750.5630.705 0.5890.6520.637 0.6370.5850.677 0.5860.6380.616 0.6050.6130.642
AttackVLM-T 0.4820.3170.547 0.3450.4490.455 0.4160.3590.479 0.3490.4220.412 0.3720.3880.477
AttackVLM-Q 0.5170.3090.561 0.3610.4700.467 0.4470.3560.414 0.3650.4310.396 0.3900.3920.427
LLaVA VQA Clean 0.6800.8230.755 0.7330.7140.741 0.7280.8100.748 0.7250.7060.733 0.7120.7980.742
Attack-MMFM 0.5390.7240.626 0.5990.5960.617 0.6180.7100.641 0.6080.6020.625 0.5950.6950.632
OpenFlamingo VQA Clean 0.6900.8170.756 0.7280.7230.743 0.7340.8040.749 0.7200.7150.735 0.7080.7910.742
Attack-MMFM 0.5350.7140.618 0.5840.6090.612 0.6090.7010.635 0.5960.6140.620 0.5820.6880.625
MiniGPT-4 VQA Clean 0.6510.5360.684 0.5610.6280.618 0.6120.5600.646 0.5590.6130.593 0.5780.5870.620
AttackVLM-T 0.5680.4570.620 0.4720.5480.551 0.5230.4810.576 0.4690.5320.519 0.4890.5040.549
DenseNet CL Clean 0.5430.7400.705 0.6710.6740.667 0.6920.6580.695 0.6650.6880.671 0.6520.6600.679
FGSM 0.4440.6660.572 0.5370.5480.553 0.5790.5210.584 0.5350.5720.558 0.5180.5410.567
BIM 0.5070.7130.593 0.5540.5320.579 0.6010.5420.598 0.5480.5860.571 0.5310.5530.581
PGD 0.4950.7050.585 0.5460.5240.571 0.5930.5340.590 0.5400.5780.563 0.5230.5450.573
DeepFool 0.4750.6900.565 0.5250.5100.555 0.5750.5150.570 0.5200.5600.545 0.5050.5250.555
C&W 0.4600.6800.555 0.5150.5000.545 0.5650.5050.560 0.5100.5500.535 0.4950.5150.545
MobileNet CL Clean 0.6680.7900.729 0.7040.7050.719 0.7450.6980.738 0.7020.7260.712 0.6880.6950.721
FGSM 0.5200.7120.606 0.6120.5850.607 0.6350.5980.629 0.6050.6180.610 0.5850.5920.615
BIM 0.5030.6930.581 0.5650.5380.576 0.6050.5720.598 0.5750.5900.582 0.5580.5650.585
PGD 0.4950.6850.573 0.5570.5300.568 0.5970.5640.590 0.5670.5820.574 0.5500.5570.577
DeepFool 0.4750.6700.555 0.5400.5150.552 0.5800.5480.573 0.5500.5650.557 0.5350.5420.562
C&W 0.4650.6600.545 0.5300.5050.542 0.5700.5380.563 0.5400.5550.547 0.5250.5320.552

Results shown for random CLIP encoder selection with One-Time-Use perturbations across different ensemble sizes and noise scales. Clean images consistently achieve high similarity scores, while adversarial examples show degraded similarity.

Table 3: Detection accuracy of Stochastic MirrorCheck across diverse victim models and attack types

Victim Model Setting 1 Encoder 3 Encoders 5 Encoders 7 Encoders 10 Encoders
5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3 5e-65e-41e-3
UniDiffuser AttackVLM-T 0.9130.8950.903 0.9250.9430.858 0.9330.8900.910 0.9080.9120.915 0.9200.9180.917
AttackVLM-Q 0.9550.9020.968 0.9520.9680.937 0.9730.9750.980 0.9770.9780.980 0.9750.9730.992
BLIP AttackVLM-T 0.9050.9080.908 0.9180.9320.903 0.9220.9150.925 0.9300.9300.918 0.9300.9230.925
AttackVLM-Q 0.9280.8870.943 0.9130.9370.915 0.9580.9430.963 0.9620.9470.955 0.9650.9450.955
BLIP-2 AttackVLM-T 0.9120.8920.912 0.9200.9320.907 0.9230.9230.928 0.9270.9350.932 0.9300.9270.938
AttackVLM-Q 0.9450.9030.967 0.9480.9570.932 0.9750.9700.985 0.9820.9780.977 0.9780.9700.990
Attack-Bard 0.8830.7900.827 0.8900.8730.900 0.9030.9520.903 0.9420.9130.902 0.9270.9200.938
Img2Prompt AttackVLM-T 0.8480.8400.843 0.8780.8730.853 0.8820.8600.867 0.8780.8750.883 0.8950.8780.882
AttackVLM-Q 0.8800.8060.907 0.8610.8630.855 0.8860.8570.905 0.8700.8770.905 0.8870.8820.920
LLaVA Attack-MMFM 0.7880.7280.733 0.8120.7620.738 0.8180.8120.783 0.8320.8150.812 0.8450.8330.820
OpenFlamingo Attack-MMFM 0.8000.7400.765 0.7770.7500.750 0.8070.7850.767 0.8000.7800.797 0.7970.7930.785
MiniGPT-4 AttackVLM-T 0.6420.6230.632 0.6600.6600.642 0.6550.6650.667 0.6550.6650.657 0.6550.6650.655
DenseNet FGSM 0.8500.8400.800 0.8450.8350.795 0.8520.8420.802 0.8470.8370.798 0.8490.8390.801
BIM 0.8600.8300.830 0.8550.8250.825 0.8620.8320.832 0.8570.8270.828 0.8590.8290.831
PGD 0.8450.8250.815 0.8400.8200.810 0.8470.8270.817 0.8420.8220.812 0.8440.8240.814
DeepFool 0.8350.8150.795 0.8300.8100.790 0.8370.8170.797 0.8320.8120.792 0.8340.8140.794
C&W 0.8250.8050.785 0.8200.8000.780 0.8270.8070.787 0.8220.8020.782 0.8240.8040.784
MobileNet FGSM (0.3) 0.7800.7700.750 0.7750.7650.745 0.7820.7720.752 0.7770.7670.747 0.7790.7690.749
FGSM (0.1) 0.8500.8500.790 0.8450.8450.785 0.8520.8520.792 0.8470.8470.787 0.8490.8490.789
BIM 0.7900.7900.780 0.7850.7850.775 0.7920.7920.782 0.7870.7870.777 0.7890.7890.779
PGD 0.7750.7750.765 0.7700.7700.760 0.7770.7770.767 0.7720.7720.762 0.7740.7740.764
DeepFool 0.7600.7500.740 0.7550.7450.735 0.7620.7520.742 0.7570.7470.737 0.7590.7490.739
C&W 0.7450.7350.725 0.7400.7300.720 0.7470.7370.727 0.7420.7320.722 0.7440.7340.724

The method achieves consistently high detection rates (65-99%) across VLM attacks (AttackVLM, Attack-Bard, Attack-MMFM) and classification attacks (FGSM, BIM, PGD, DeepFool, C&W), demonstrating robust performance with randomized encoder selection and One-Time-Use perturbations.

Qualitative Results

Visual comparison showing the MirrorCheck pipeline in action using BLIP as the victim model and Stable Diffusion as the T2I generator. For clean images, the regenerated image shares strong semantic similarities with the original. Under an adversarial attack, the forced incorrect caption leads to a completely disconnected regeneration, triggering the detection.

Clean Example 1
High Similarity → Clean
Adversarial Example 1
Low Similarity → Attacked
Clean Example 2
High Similarity → Clean
Adversarial Example 2
Low Similarity → Attacked
Clean Example 3
High Similarity → Clean
Adversarial Example 3
Low Similarity → Attacked
Clean Example 4
High Similarity → Clean
Adversarial Example 4
Low Similarity → Attacked
Clean Example 5
High Similarity → Clean
Adversarial Example 5
Low Similarity → Attacked
Clean Example 6
High Similarity → Clean
Adversarial Example 6
Low Similarity → Attacked
Clean Example 7
High Similarity → Clean
Adversarial Example 7
Low Similarity → Attacked
Clean Example - Flower
High Similarity → Clean
Adversarial Example - Giraffes
Low Similarity → Attacked
Clean Example - Pancake
High Similarity → Clean
Adversarial Example - Skateboarder
Low Similarity → Attacked
Clean Example - Leopard
High Similarity → Clean
Adversarial Example 10
Low Similarity → Attacked

BibTeX

@inproceedings{fares2026mirrorcheck,
  title={MirrorCheck: Efficient Adversarial Defense for Vision-Language Models},
  author={Samar Fares and Klea Ziu and Toluwani Aremu and Nikita Durasov and Martin Tak{\'a}{\v{c}} and Pascal Fua and Karthik Nandakumar and Ivan Laptev},
  booktitle={The 6th Workshop of Adversarial Machine Learning on Computer Vision: Safety of Vision-Language Agents},
  year={2026},
  url={https://arxiv.org/abs/2406.09250}
}