Reinforcement Learning (with Verifiable Rewards) Doesn't Elicit New Capabilities

Wake up babe, new arXiv paper just dropped

May 04, 2025

Yesterday, I was doing some searching n’ thinking about AI R&D automation (as per my last post) when I stumbled upon an absolute madlad of a paper. This paper, fresh off the press (published two weeks ago), claims that Reinforcement Learning with Verifiable Rewards (RLVR)–the method that made o3–does not elicit fundamentally new capabilities from the base model. Huge if true.

It was linked in a comment on Helen Toner’s excellent recent post on verifiable rewards, generalization, and scaling reasoning training. I’ll cover her post first for context, then come back to the paper.

Toner’s Post

The post lays out how the shiny new reasoning models (e.g. OAI o1, DeepSeek-r1, Gemini Flash Thinking) were developed through post-training using RLVR, which utilizes cheap and scalable automated tests, rather than RLHF’s costly human feedback.

However, reliable RLVR is currently limited to math, coding, tool use, and similar domains. There are thus two key questions that will determine how far AI progresses over the next few years:

Expanding domains – Besides math & coding, in what other domains can we automatically grade answers to hard problems?
Generalization – Does performance in auto-graded domains generalize to performance on other tasks?

Basically, for AI 2027 or similar to materialize, Toner thinks at least one of these two things needs to go pretty well (for developers). Either they need to create unit tests for poetry—or, it needs to turn out that passing coding unit tests also makes an LLM a world-class poet. (Or decent accountant. I thought “poet” sounded more…poetic, but to be honest, we probably care more about accounting).

Toner’s best guess is that (1) auto-grading will be possible for a select few more domains, but (2) this will result in enough generalization that fine-tuning on smaller, human-curated datasets will produce excellent performance on a much wider range of domains.

This doesn’t seem an unreasonable guess, and I agree these questions are key cruxes for near-term timelines. I’d recommend reading the piece in full, as well as the very interesting comment section, which is where I happened upon the paper: “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” by Yue et al.

Yue et al. Paper

Like Toner’s post, the paper begins with the premise that RLVR has been used to elicit reasoning behaviors not present in base models (e.g. enumeration, self-reflection, iterative refinement). But then, the authors proceed to courageously ask: did it, like, actually, though?

They posit that a model’s true reasoning capabilities may be underestimated if it fails given a few trials, but could’ve succeeded with more attempts. To address this, they employ a simple method: the pass@k metric. The model is allowed to attempt a task k times, and if any of the samples is correct, it passes. For k, they use values 1 to 128, with particular interest in larger samples.

The results are quite surprising: RLVR-trained models perform worse than base models at large k values. Apparently, base models can already solve problems previously considered solvable only for RLVR-trained models. And with larger samples, they reliably outperform RLVR-trained models on all given tasks.

**Figure 2:** Pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks. When k is small, RL-trained models outperform their base versions. However, as k increases to the tens or hundreds, base models consistently catch up with RL-trained models across all benchmarks and LLM families without exception. Eventually, base models surpass RL-trained models.

The authors deduce that “RLVR boosts sample efficiency, but reduces scope of reasoning capacity.” In other words, RLVR does not make models more capable; it merely makes them more reliable. In fact, in the process of promoting reliability, RLVR limits the model’s exploration, making it less likely to find optimal solutions outside its narrowed scope.

**Figure 1:** The effect of RLVR on LLM’s reasoning ability. Search trees are generated by repeated sampling from the base and RLVR-trained models for a given problem. Grey indicates paths that are unlikely to be sampled by the model, while black indicates paths that are likely to be sampled. Green indicates correct paths, which has positive rewards. Our key finding is that all reasoning paths in the RLVR model are already present in the base model. For certain problems like Problem A, RLVR training biases the distribution toward rewarded paths, improving sampling efficiency. However, this comes at the cost of reduced scope of reasoning capacity: For other problems like Problem B, the base model contains the correct path, whereas that of the RLVR model does not.

Unlike with traditional RL (e.g. AlphaGo), the action space for LLMs is exponentially larger, which means that it is impossible to train from scratch. As such, RLVR for LLMs begins with a pretrained base model with a useful prior. However, in such a complex and highly combinatorial space, most responses generated during training are constrained by the base model’s prior—thus, they are constrained by the base model’s reasoning capabilities. This is why the authors find that, unlike RLVR, distillation actually does elicit improved capabilities: it injects better priors.

Discussion

This, of course, is big if true. If true, this means that a model’s full potential can be determined prior to RLVR. Even if you can’t yet automatically grade a domain, if you run a task a bunch of times and manually review the results, you’ll know that auto-graded training will enable the model to consistently output from the high end of the distribution—but no higher.

In this world, auto-grading is less important than Toner suggests. Sure, reliability is still of practical importance, especially with compute constraints. But pre-training—expensive, time-consuming pre-training—will remain the primary vector for model improvement. RLVR only provides a limited multiplier.

Even to a perennial skeptic like me, this seems a bit wild. One major thing against the paper is that all the experiments are performed on open-weight models. The result might not generalize to frontier-scale models, especially if developers used more sophisticated post-training techniques. I’d be keen to see this research replicated for frontier models, although I’m not exactly sure how one would do this. o3 suggested something it called “Model stealing (ethical edition).”

I also don’t fully trust my paper-reading skills, so I’d be eager to see my wonderful subscribers—the cognitive elite of the elite—read the paper themselves and comment their thoughts below.

Harjas Sandhu

May 10Edited

I wonder if this is something that will seem obvious in hindsight. OpenAI compared fine-turning to dog training in 2023 (https://openai.com/index/how-should-ai-systems-behave/), and intuitively, there's an upper limit to what a dog is capable learning of regardless of how good a dog trainer's reinforcement learning techniques are. Obviously AIs are not dogs, but reinforcement learning does seem like a brute-force way of trying to get your AI to spit out the right answer by rote.

Of course, this is not something that we can know in advance before someone actually does the research. Great post discussing a sick paper!

Expand full comment

1 reply by Michelle Ma

Woolery

May 4Edited

I don’t know enough about the field to have a feel for this, but the paper’s findings surprised me. I thought RL was simply a plus, like a general capability booster.

The paper, as far as I understood it, suggest RL with simple correctness rewards is great for funneling good answer likelihoods but not for teaching a language model brand‑new reasoning tricks. In fact it can stifle them. If you truly need fresh capabilities, you’ll have to feed it fresh information or rethink the training recipe.

I’d be curious if this effect depends to any degree on model size. And if tweaking reward signals, like giving the AI a biscuit for diversity or something, might help keep the search space from narrowing.

1 more comment...

Reinforcement Learning (with Verifiable Rewards) Doesn't Elicit New Capabilities

Wake up babe, new arXiv paper just dropped

Toner’s Post

Yue et al. Paper

Discussion

Discussion about this post