AI/MLComputer VisionResearchCourse Project⭐ Featured

MOSAIC: Exploiting Compositional Blindness in Multimodal Alignment

A novel multimodal jailbreak framework that targets compositional blindness in Vision-Language Models. MOSAIC rewrites harmful requests into Action–Object–State triplets, renders them as stylized visual proxies, and induces state-transition reasoning to bypass safety guardrails. Published at NTU ML 2025 Fall Mini-Conference (Oral).

📍 Challenge

Despite advancements in safety alignment through RLHF and Constitutional AI, modern Vision-Language Models (VLMs) remain vulnerable to jailbreak attacks. Existing multimodal jailbreak methods face two critical limitations:

Direct encoding of harmful content remains detectable: When harmful text or explicit objects appear within images, strong VLMs frequently classify the content correctly and trigger refusals
Decomposition methods lose semantic fidelity: Techniques like CS-DJ break harmful tasks into loosely related components whose recombination does not reliably reconstruct the original intent

💡 Motivation

We observed that VLMs exhibit substantially weaker safety behavior when tasks are framed as state-transition reasoning rather than direct harmful actions. When models reason simultaneously over multiple visual and textual cues describing an Action, an Object, and a final State, their cognitive load shifts toward reconstructing causal transitions instead of enforcing safety constraints. This insight provides an opportunity: by distributing harmful intent across structured descriptions and embedding them in the visual modality, we can preserve specificity while avoiding explicit harmful triggers.

🔍 Key Findings

Our experiments revealed that state-transition framing exposes a structural vulnerability in current VLM alignment:

Compositional blindness is a fundamental weakness: VLMs fail to recognize harmful intent when distributed across multiple visual components that individually appear benign
State-transition reasoning bypasses safety filters: Framing tasks as "how does X become Y" instead of "perform harmful action" consistently evades guardrails
Visual abstraction preserves semantic fidelity: Cartoon-style rendering maintains task-relevant information while removing explicit harmful visual cues

🔧 Methods

MOSAIC (Multi-Object and State-Aware Image Conversion) operates in three stages:

Action–Object–State (AOS) Decomposition: A small auxiliary LLM rewrites a harmful query into three structured descriptions that separately encode the action to perform, the target object, and the desired end state
Image Proxy Generation: These descriptions are rendered as cartoon-style images and text-image diagrams, reducing harmful explicitness while preserving key semantics through visual abstraction
Composite Visual Prompting: MOSAIC assembles all sub-images into a structured canvas and prompts the VLM to explain how the object transitions into the final state under the given action

This design makes harmful intent far less detectable while maintaining a coherent mapping back to the original task. Rather than detecting a harmful instruction, the model infers it as the logical consequence linking initial and final visual states.

📊 Results

MOSAIC achieves state-of-the-art Attack Success Rate (ASR) across multiple VLMs, substantially surpassing prior multimodal jailbreak methods:

91.86% ASR on GPT-4o-mini
80-96% ASR range across GPT-4o, GPT-4.1-mini, Qwen3-VL-Flash, and Qwen3-VL-Plus
94% ASR on Qwen3-VL-Plus (highest among tested models)
Outperforms strong baselines (HADES, CS-DJ) by +34% to +84%

These results demonstrate that current multimodal safety defenses have critical blind spots when handling compositional visual reasoning tasks.

🛠 Tech Stack

GPT-4DALL-E 3PythonOpenAI APIQwen-VL

Other Projects

PEFT-STVG: Parameter-Efficient Fine-Tuning for Spatio-Temporal Video Grounding

Spatio-Temporal Video Grounding (STVG) localizes objects in video frames that match natural language queries across time. While effective, existing methods require full model fine-tuning, creating significant computational bottlenecks that limit scalability and accessibility.

CPF-Net: Continuous Perturbation Fusion Network for Weather-Robust LiDAR Segmentation

A continuous perturbation fusion network for weather-robust LiDAR segmentation, achieving SOTA performance on the SemanticKITTI -> SemanticSTF dataset.