**Hengguang Zhou$^$, Xirui Li$^$, Ruochen Wang†, Minhao Cheng, Tianyi Zhou** and Cho-Jui Hsieh
$^*$: Project Lead
†: Main Advisor
https://github.com/turningpoint-ai/VisualThinker-R1-Zero
Figure 1. The training dynamics of VisualThinker-R1-Zero on Qwen2-VL base model. Benchmark accuracy is measured on CV-Bench, and the average response length is calculated from rollouts on SAT training samples. Initially, we observed a drop in length because the base model tended to generate HTML code. This behavior was quickly suppressed by RL, leading the model to adopt a more appropriate output format and a regular increase in response length. Afterwards, we observed a multimodal ‘aha moment’—the emergence of self-reflection in models’ response, as described in the DeepSeek-R1 paper, followed by a consistent positive correlation between response length and benchmark accuracy
DeepSeek R1 has demonstrated how Reinforcement Learning (RL) with well-designed rule-based incentives can enable a large language model to build unique reasoning capabilities autonomously. Many researchers have attempted to extend this success to multimodal reasoning. However, recent efforts primarily struggle to reproduce the increasing response length and thinking pattern exhibited by DeepSeek R1. We are the first to successfully produce the emergent “aha moment” and increased response length for multimodal reasoning on just a non-SFT 2B model. Our findings show that longer reasoning can greatly benefit visual-centric tasks. We start from Qwen2-VL-2B base model and directly perform reinforcement learning on the SAT dataset. Without any SFT, the model achieves 59.47% accuracy on CVBench, beating the base model for ~30% and the SFT model for ~2%. Our model even greatly surpasses the instruction-finetuned model whose training data are significantly more.
To support future research, we have open-sourced our key insights and training code at Github , hoping to facilitate future studies on multimodal reasoning.
Contributions:
DeepSeek R1 has demonstrated that reinforcement learning can enhance a model’s reasoning abilities without any supervised reasoning data. We summarize the key characteristics that contributed to its success and compare them with our model and other multimodal replications. Specifically, we highlight two emergent phenomena: the "aha moment" and the increasing response length. The "aha moment" refers to the model’s autonomous development of advanced problem-solving strategies during training; the increasing response length indicates model naturally learns to solve reasoning tasks with more thinking time during training. It remains questionable whether these replications are truly valid without observing the key characteristics of DeepSeek-R1.
Comparison Between DeepSeek R1 and Multimodal Replications
DeepSeek R1 | VisualThinker R1 (Ours) | R1-V | R1-Multimodal-Journey | open-r1-multimodal | |
---|---|---|---|---|---|
Base Model | DeepSeek V3 | Qwen2-VL-2B | Qwen2-VL-2B-Instruct | Qwen2-VL-2B-Instruct | Qwen2-VL-2B/7B-Instruct |
Modality | Language | Vision + Language | Vision + Language | Vision + Language | Vision + Language |
Aha Moment | Yes | Yes | No | Yes | No |
Response Length Dynamics | ↑ | ↑ | ↓ | ↓ | ↓ |