Sep, 2025

Yu-Cheng Chou$^1$, Hardy Chen$^2$, Haoqin Tu$^2$, Hui Liu$^3$, Xianfeng Tang$^3$, Zeyu Zheng$^4$, Yuyin Zhou$^2$, Alan Yuille$^{1}$, Cihang Xie$^{2*}$, Junfei Xiao$^{1*}$

$^1$Johns Hopkins University, $^2$UC Santa Cruz, $^3$Amazon, $^4$UC Berkeley

*equal advising

Understanding how large language models reason after visual alignment is crucial for building reliable and interpretable multimodal systems.

In this blog, we look at two popular open LLMs — GPT-OSS-20B-A4B [1] and Qwen3-30B-A3B [2] ****— after they are aligned with vision using InternVL3.5 [3]. The resulting MLLMs, InternVL3_5-GPT-OSS-20B-A4B-Preview [3] and InternVL3_5-30B-A3B [3], are evaluated on MMMU Val [4] to compare how their reasoning changes once vision is added.

We run both models with identical evaluation settings, clean their outputs to remove repetition loops and malformed answers, and then look at accuracy and reasoning length distribution. To go one step further, we also run a large-scale error and style analysis using Gemini 2.5 Pro [5] and GPT-5 [6] to analyze the internal errors, response style, format habits, and content organization.

Here’s what we found:

Furthermore, to improve visual–reasoning alignment, we release a new reasoning dataset:

<aside> <img src="/icons/database_gray.svg" alt="/icons/database_gray.svg" width="40px" />

https://huggingface.co/datasets/UCSC-VLAA/VLAA-Thinking/blob/main/VLAA-Thinking-SFT-v0.1-qwen25vl_32b-26K.jsonl,

</aside>

which can be used during GPT-OSS visual alignment to enable long visual reasoning fine-tuning. We hope it helps the community build more reliable and interpretable multimodal OSS models.


Table of Content