**Hengguang Zhou$^$, Xirui Li$^$, Ruochen Wang†, Minhao Cheng, Tianyi Zhou** and Cho-Jui Hsieh

$^*$: Project Lead

†: Main Advisor

https://github.com/turningpoint-ai/VisualThinker-R1-Zero



Introduction

Figure 1. The training dynamics of VisualThinker-R1-Zero on Qwen2-VL base model. Benchmark accuracy is measured on CV-Bench, and the average response length is calculated from rollouts on SAT training samples. Initially, we observed a drop in length because the base model tended to generate HTML code. This behavior was quickly suppressed by RL, leading the model to adopt a more appropriate output format and a regular increase in response length. Afterwards, we observed a multimodal ‘aha moment’—the emergence of self-reflection in models’ response, as described in the DeepSeek-R1 paper, followed by a consistent positive correlation between response length and benchmark accuracy

Figure 1. The training dynamics of VisualThinker-R1-Zero on Qwen2-VL base model. Benchmark accuracy is measured on CV-Bench, and the average response length is calculated from rollouts on SAT training samples. Initially, we observed a drop in length because the base model tended to generate HTML code. This behavior was quickly suppressed by RL, leading the model to adopt a more appropriate output format and a regular increase in response length. Afterwards, we observed a multimodal ‘aha moment’—the emergence of self-reflection in models’ response, as described in the DeepSeek-R1 paper, followed by a consistent positive correlation between response length and benchmark accuracy

DeepSeek R1 has demonstrated how Reinforcement Learning (RL) with well-designed rule-based incentives can enable a large language model to build unique reasoning capabilities autonomously. Many researchers have attempted to extend this success to multimodal reasoning. However, recent efforts primarily struggle to reproduce the increasing response length and thinking pattern exhibited by DeepSeek R1. We are the first to successfully produce the emergent “aha moment” and increased response length for multimodal reasoning on just a non-SFT 2B model. Our findings show that longer reasoning can greatly benefit visual-centric tasks. We start from Qwen2-VL-2B base model and directly perform reinforcement learning on the SAT dataset. Without any SFT, the model achieves 59.47% accuracy on CVBench, beating the base model for ~30% and the SFT model for ~2%. Our model even greatly surpasses the instruction-finetuned model whose training data are significantly more.

To support future research, we have open-sourced our key insights and training code at Github , hoping to facilitate future studies on multimodal reasoning.

Contributions:

  1. We are the first to replicate the key characteristics of R1 success (”aha moment” and increased reasoning length) on multimodal reasoning tasks with an non-SFT 2B model.
  2. We showed that vision-centric tasks could also benefit from improved reasoning capabilities.
  3. We open-sourced our training code and findings on response length and hope to facilitate future studies on multimodal reasoning.

Key Characteristics of DeepSeek R1

DeepSeek R1 has demonstrated that reinforcement learning can enhance a model’s reasoning abilities without any supervised reasoning data. We summarize the key characteristics that contributed to its success and compare them with our model and other multimodal replications. Specifically, we highlight two emergent phenomena: the "aha moment" and the increasing response length. The "aha moment" refers to the model’s autonomous development of advanced problem-solving strategies during training; the increasing response length indicates model naturally learns to solve reasoning tasks with more thinking time during training. It remains questionable whether these replications are truly valid without observing the key characteristics of DeepSeek-R1.

Comparison Between DeepSeek R1 and Multimodal Replications

DeepSeek R1 VisualThinker R1 (Ours) R1-V R1-Multimodal-Journey open-r1-multimodal
Base Model DeepSeek V3 Qwen2-VL-2B Qwen2-VL-2B-Instruct Qwen2-VL-2B-Instruct Qwen2-VL-2B/7B-Instruct
Modality Language Vision + Language Vision + Language Vision + Language Vision + Language
Aha Moment Yes Yes No Yes No
Response Length Dynamics