Well-Formed, Ill-Grounded: Visual Alignment Gaps Where GPT-OSS Falls Short of Qwen

Sep, 2025

Yu-Cheng Chou$^1$, Hardy Chen$^2$, Haoqin Tu$^2$, Hui Liu$^3$, Xianfeng Tang$^3$, Zeyu Zheng$^4$, Yuyin Zhou$^2$, Alan Yuille$^{1}$, Cihang Xie$^{2*}$, Junfei Xiao$^{1*}$

$^1$Johns Hopkins University, $^2$UC Santa Cruz, $^3$Amazon, $^4$UC Berkeley

*equal advising

Understanding how large language models reason after visual alignment is crucial for building reliable and interpretable multimodal systems.

In this blog, we look at two popular open LLMs — GPT-OSS-20B-A4B [1] and Qwen3-30B-A3B [2] ****— after they are aligned with vision using InternVL3.5 [3]. The resulting MLLMs, InternVL3_5-GPT-OSS-20B-A4B-Preview [3] and InternVL3_5-30B-A3B [3], are evaluated on MMMU Val [4] to compare how their reasoning changes once vision is added.

We run both models with identical evaluation settings, clean their outputs to remove repetition loops and malformed answers, and then look at accuracy and reasoning length distribution. To go one step further, we also run a large-scale error and style analysis using Gemini 2.5 Pro [5] and GPT-5 [6] to analyze the internal errors, response style, format habits, and content organization.

Here’s what we found:

900 MMMU Val samples shrank to 396 after filtering repetition loops and format errors — a crucial step to get reliable scores.
On this cleaned set, Qwen-30B beats OSS by a wide margin (76.3 vs 54.0).
OSS underperforms mainly due to missing visual grounding and misinterpreting the question, often hallucinating and never reaching a final answer.
OSS stays shorter and more consistent, while Qwen shows a strong long-tail reasoning pattern (mean 4,227 vs median 1,476 characters).
OSS tends to write longer responses in Humanities & Arts, but Qwen dominates STEM with very long reasoning chains.
OSS is more formal and markdown-structured, whereas Qwen is more conversational.

Furthermore, to improve visual–reasoning alignment, we release a new reasoning dataset:

https://huggingface.co/datasets/UCSC-VLAA/VLAA-Thinking/blob/main/VLAA-Thinking-SFT-v0.1-qwen25vl_32b-26K.jsonl,

</aside>

which can be used during GPT-OSS visual alignment to enable long visual reasoning fine-tuning. We hope it helps the community build more reliable and interpretable multimodal OSS models.

Table of Content