Lightweight Grounding Model Combining a CLIP-based Encoder With an Upsampling Decoder

TASKS

VizWiz-VQA-Grounding

Given a visual question (question-image pair), the task is to return the region in the image used to arrive at the answer to the visual question. The submissions will be evaluated based on the mean Intersection over Union (IoU) score across all test images. The team which achieves the maximum IoU score wins this challenge.

METHODS

In this section, we provide a detailed description of the three main components of the proposed mode 1) CLIP Encoder. We use the pre-trained CLIP ViTL/14-336 model [3] to extract embeddings from both the image and the concatenated question-answer pair. The visual embedding is taken from the final transformer layer, excluding the CLS token, and reshaped into a spatial feature map of shape (B, D, H, W). 2) Cross-Attention Block. A single-layer multi-head cross-attention is employed to align textual and visual information. Specifically, the image tokens (flattened spatial features) serve as queries, while the text tokens act as the keys and values. This allows the model to dynamically attend to relevant image regions based on the contextual information provided by the question. 3) Lightweight U-Net Decoder. The fused features from the cross-attention block are reshaped back into spatial form and passed through a lightweight U-Net-style decoder [4], which produces the final binary mask indicating the answer grounding region.

The cross-attention block is applied after visual features are extracted and flattened, but before being reshaped and decoded. This design allows the early integration of semantic cues into the spatial reasoning process.

Limitation 1

Limitation of Random Crop Preprocessing Without Considering Dataset Characteristics.

Limitation 2

Difficulty in Detecting Small and Edge-Located Objects.

Limitation 3

Limitation in Binary Conversion of Grounding Regions

MEMBERS

JIHEE YOON

I’m a master’s student in AI at the CVML Lab, Chung-Ang University. My research focuses at the intersection of vision, language, and generative AI. In this project, I took the lead on designing the model structure, analyzing failure cases, and driving the overall direction of the study — especially in articulating its limitations and what we learned from them.

SEUNGA LEE

I am currently studying in the Department of Computer Science and Engineering at Chung-Ang University, with interest in computer vision and backend engineering. In this project, I was responsible for data preprocessing and visualization, and also participated in experiments to improve model performance.

HAESOL JEONG

I'm currently majoring in Computer Science at Chung-Ang University. I'm interested in computer vision. I contributed to improving model performance by modifying the decoder architecture to a lightweight version. and took part in as an MLOps engineer, setting up an experiment version control system and a structured framework for logging intermediate results.