Loading...

Back to Projects
AI/MLComputer VisionResearch⭐ Featured

PEFT-STVG: Parameter-Efficient Fine-Tuning for Spatio-Temporal Video Grounding

Spatio-Temporal Video Grounding (STVG) localizes objects in video frames that match natural language queries across time. While effective, existing methods require full model fine-tuning, creating significant computational bottlenecks that limit scalability and accessibility.

📍 Challenge

Parameter-efficient fine-tuning (PEFT) techniques had not been explored for STVG tasks. Full fine-tuning of large vision-language models demands substantial GPU memory and extended training times, making iterative experimentation costly.

🔍 Key Findings

Analysis across adapter placements revealed a progressive fusion pattern:
  • Early layers align motion cues between modalities
  • Later layers align semantic cues for high-level understanding
This early fusion strategy proved especially effective for complex queries involving multiple objects or temporal transitions, where early text-video interaction helps disambiguate targets across frames.

🔧 Methods

We developed a PEFT-based framework using cross-modal adapters inserted across encoder layers to progressively fuse video and text features. Rather than fine-tuning the entire model, lightweight adapter modules learn task-specific cross-modal alignments while keeping pre-trained weights frozen.

📸 Figures

PEFT-STVG Overview
Figure 1. Overall Pipeline of PEFT-STVG.
Cross-Modal Adapter Results
Figure 2. Overall Pipeline of Cross-Modal Adapter. The adapter uses a transformer-based architecture to capture the cross-modal relationships between video and text.

📊 Results

HC-STVG-v2 Results
Figure 3. Quantitative results on HC-STVG-v2 benchmark.

🛠 Tech Stack

PyTorchCUDATransformerFlashAttention