He Haoran, the primary creator of the paper, is a Ph.D. scholar on the Hong Kong University of Science and Technology. His analysis pursuits embody reinforcement studying and basis fashions, with the aim of inspiring superintelligence via expertise and rewards. Ye Yuxiao, the co-first creator, is a first-year Ph.D. scholar on the Hong Kong University of Science and Technology. The corresponding creator is Pan Ling, an assistant professor within the Department of Electronic and Computer Engineering and the Department of Computer Science and Engineering on the Hong Kong University of Science and Technology.
In the mathematical reasoning duties of huge language fashions (LLMs), reinforcement studying with verifiable rewards (RLVR) has turn out to be an necessary means to boost the reasoning skill of fashions. However, mainstream strategies such as PPO and GRPO nonetheless depend on the training goals of coverage gradient updates designed for conventional RL situations, which could be basically characterised by coverage enchancment, that’s, a steady cycle means of coverage analysis and coverage enchancment. These strategies typically face issues such as unstable coaching, lack of range, and sophisticated parameter tuning.
So, is there a less complicated and extra basic answer for LLM reasoning duties?
The analysis group, collectively led by the Hong Kong University of Science and Technology, Step, and Kuaishou proposed a shocking reply: merely evaluating the worth of a utterly random coverage is ample to search out the optimum reasoning path. They thus proposed ROVER (Random Policy Valuation for Diverse Reasoning), which subverts the standard paradigm with a minimalist method and skips the coverage enchancment cycle of conventional reinforcement studying reasoning.
ROVER not solely considerably outperforms present strategies on a number of mathematical reasoning benchmarks but in addition achieves high-quality and high-diversity reasoning technology with “minimalism.”
Currently, the paper, code, and mannequin have all been open-sourced.
- Paper hyperlink: https://arxiv.org/abs/2509.24981
- Code hyperlink: https://github.com/tinnerhrhe/ROVER
In high-difficulty duties such as AIME24, AIME25, and HMMT25, ROVER considerably improves move@1 (+8.2) and move@256 (+16.8) in comparison with conventional strategies and reaches new heights in varied range metrics (+17.6%). Moreover, ROVER doesn’t require extra upkeep of a worth community or a reference mannequin for KL calculation, making it extra light-weight.
The “Pain Points” of Traditional Reinforcement Learning: Complex Iteration and High Cost
In LLM reasoning optimization, mainstream strategies (such as PPO and GRPO) could be characterised by Generalized Policy Iteration – repeatedly performing “policy evaluation (calculating the value of the current policy, such as estimating the advantage function)” and “policy improvement (updating the policy [mathematical formula]).” Although these strategies can enhance efficiency, they’ve core ache factors:
- Poor coaching stability: The optimization goal is “non-stationary,” and the mannequin is liable to collapse. Recent work has added advanced strategies such as KL regularization, clipped significance sampling, and entropy monitoring. These “patches” make coaching precarious, and a slight mistake can result in “entropy collapse” (a sudden drop in coverage range and getting caught in a single reasoning path).
- PPO wants to keep up an impartial worth community to foretell the state worth and repeatedly carry out coverage iteration: Methods like GRPO additionally want to keep up a reference mannequin for KL calculation. This “heavy-asset” mannequin will increase the computational price of RL optimization.
- Loss of reasoning range: Sacrificing exploration for high quality, the efficiency of move@ok saturates. Traditional reinforcement studying strategies based mostly on reward maximization make the mannequin overly pursue the single-step reasoning accuracy, sacrificing the coverage exploration skill – the mannequin solely generates a few reasoning paths, sacrificing the move@ok (the flexibility to cowl extra possible options via a number of reasoning makes an attempt).
The “Minimalist Revolution” of ROVER: The Q-value of a Random Policy is Sufficient to Guide Optimal Decisions
The analysis group first identified that LLM reasoning duties could be modeled as a finite-horizon Markov determination course of (MDP) with the next key traits:
- Deterministic state transitions;
- Tree construction (every state has a distinctive mother or father node, and there are not any disjoint subtrees);
- Binary sparse rewards (right/incorrect).
This is kind of totally different from the advanced settings such as random state transitions, cyclic graph buildings, and intermediate rewards generally present in conventional RL duties (such as Atari video games and robotic management).
“Are we using overly complex tools to solve a structurally simpler problem?” – This grew to become the place to begin of the ROVER analysis.
In this straightforward construction, the analysis group proved a subversive conclusion: The Q-value of a uniformly random coverage immediately factors to the optimum coverage.
Let the setting be an MDP with a finite horizon, a tree-structured state house, and binary rewards.
is a uniformly random coverage (the chance of every motion choice is 1/|A|),
is its Q-value. Then the grasping coverage (as proven under) is the optimum coverage!
The proof is intuitive: In a tree construction, if there’s a right reply within the subtree of an motion
then
; in any other case
. Therefore, greedily deciding on the motion with the utmost
worth will certainly result in a path containing the proper reply.
Therefore, the coverage studying course of could be simplified as proven within the following determine.
The ROVER Algorithm: Three Simple Steps, No Iteration Required
(1) Q-value estimation:
ROVER calculates the
worth of state-action pairs beneath a uniformly random coverage via the generalized Bellman equation. Therefore, the equation is expressed utilizing the imply operator:
is the reward, s’ is the brand new state after taking motion a, and V is the motion house.
(2) Policy building:
Although grasping choice can assure optimality, it might result in a lack of range. To handle this, ROVER introduces softmax sampling based mostly on the
worth:
the place
is the temperature coefficient that controls the diploma of exploration. This method not solely retains the precedence of high-value paths but in addition explores a number of efficient reasoning routes, considerably enhancing the move@ok efficiency.
(3) Training goal:
In precise implementation, ROVER additionally introduces:
operate internalized within the LLM parameters, eliminating the necessity to prepare an extra worth community:
This “self-supervised” parameterization permits the mannequin to study “relative improvement” relatively than “absolute value,” decreasing the computational price and enhancing stability.
Group reward centralization reduces variance, that’s
. This avoids the interference of high-variance rewards on the
worth studying. At the identical time, broadcasting the centralized rewards to