Reinforcement Learning (with Verifiable Rewards) Doesn't Elicit New Capabilities
Wake up babe, new arXiv paper just dropped
Yesterday, I was doing some searching n’ thinking about AI R&D automation (as per my last post) when I stumbled upon an absolute madlad of a paper. This paper, fresh off the press (published two weeks ago), claims that Reinforcement Learning with Verifiable Rewards (RLVR)–the method that made o3–does not elicit fundamentally new capabilities from the base model. Huge if true.
It was linked in a comment on Helen Toner’s excellent recent post on verifiable rewards, generalization, and scaling reasoning training. I’ll cover her post first for context, then come back to the paper.
Toner’s Post
The post lays out how the shiny new reasoning models (e.g. OAI o1, DeepSeek-r1, Gemini Flash Thinking) were developed through post-training using RLVR, which utilizes cheap and scalable automated tests, rather than RLHF’s costly human feedback.
However, reliable RLVR is currently limited to math, coding, tool use, and similar domains. There are thus two key questions that will determine how far AI progresses over the next few years:
Expanding domains – Besides math & coding, in what other domains can we automatically grade answers to hard problems?
Generalization – Does performance in auto-graded domains generalize to performance on other tasks?
Basically, for AI 2027 or similar to materialize, Toner thinks at least one of these two things needs to go pretty well (for developers). Either they need to create unit tests for poetry—or, it needs to turn out that passing coding unit tests also makes an LLM a world-class poet. (Or decent accountant. I thought “poet” sounded more…poetic, but to be honest, we probably care more about accounting).
Toner’s best guess is that (1) auto-grading will be possible for a select few more domains, but (2) this will result in enough generalization that fine-tuning on smaller, human-curated datasets will produce excellent performance on a much wider range of domains.
This doesn’t seem an unreasonable guess, and I agree these questions are key cruxes for near-term timelines. I’d recommend reading the piece in full, as well as the very interesting comment section, which is where I happened upon the paper: “Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?” by Yue et al.
Yue et al. Paper
Like Toner’s post, the paper begins with the premise that RLVR has been used to elicit reasoning behaviors not present in base models (e.g. enumeration, self-reflection, iterative refinement). But then, the authors proceed to courageously ask: did it, like, actually, though?
They posit that a model’s true reasoning capabilities may be underestimated if it fails given a few trials, but could’ve succeeded with more attempts. To address this, they employ a simple method: the pass@k metric. The model is allowed to attempt a task k times, and if any of the samples is correct, it passes. For k, they use values 1 to 128, with particular interest in larger samples.
The results are quite surprising: RLVR-trained models perform worse than base models at large k values. Apparently, base models can already solve problems previously considered solvable only for RLVR-trained models. And with larger samples, they reliably outperform RLVR-trained models on all given tasks.

The authors deduce that “RLVR boosts sample efficiency, but reduces scope of reasoning capacity.” In other words, RLVR does not make models more capable; it merely makes them more reliable. In fact, in the process of promoting reliability, RLVR limits the model’s exploration, making it less likely to find optimal solutions outside its narrowed scope.

Unlike with traditional RL (e.g. AlphaGo), the action space for LLMs is exponentially larger, which means that it is impossible to train from scratch. As such, RLVR for LLMs begins with a pretrained base model with a useful prior. However, in such a complex and highly combinatorial space, most responses generated during training are constrained by the base model’s prior—thus, they are constrained by the base model’s reasoning capabilities. This is why the authors find that, unlike RLVR, distillation actually does elicit improved capabilities: it injects better priors.
Discussion
This, of course, is big if true. If true, this means that a model’s full potential can be determined prior to RLVR. Even if you can’t yet automatically grade a domain, if you run a task a bunch of times and manually review the results, you’ll know that auto-graded training will enable the model to consistently output from the high end of the distribution—but no higher.
In this world, auto-grading is less important than Toner suggests. Sure, reliability is still of practical importance, especially with compute constraints. But pre-training—expensive, time-consuming pre-training—will remain the primary vector for model improvement. RLVR only provides a limited multiplier.
Even to a perennial skeptic like me, this seems a bit wild. One major thing against the paper is that all the experiments are performed on open-weight models. The result might not generalize to frontier-scale models, especially if developers used more sophisticated post-training techniques. I’d be keen to see this research replicated for frontier models, although I’m not exactly sure how one would do this. o3 suggested something it called “Model stealing (ethical edition).”
I also don’t fully trust my paper-reading skills, so I’d be eager to see my wonderful subscribers—the cognitive elite of the elite—read the paper themselves and comment their thoughts below.
I wonder if this is something that will seem obvious in hindsight. OpenAI compared fine-turning to dog training in 2023 (https://openai.com/index/how-should-ai-systems-behave/), and intuitively, there's an upper limit to what a dog is capable learning of regardless of how good a dog trainer's reinforcement learning techniques are. Obviously AIs are not dogs, but reinforcement learning does seem like a brute-force way of trying to get your AI to spit out the right answer by rote.
Of course, this is not something that we can know in advance before someone actually does the research. Great post discussing a sick paper!
I don’t know enough about the field to have a feel for this, but the paper’s findings surprised me. I thought RL was simply a plus, like a general capability booster.
The paper, as far as I understood it, suggest RL with simple correctness rewards is great for funneling good answer likelihoods but not for teaching a language model brand‑new reasoning tricks. In fact it can stifle them. If you truly need fresh capabilities, you’ll have to feed it fresh information or rethink the training recipe.
I’d be curious if this effect depends to any degree on model size. And if tweaking reward signals, like giving the AI a biscuit for diversity or something, might help keep the search space from narrowing.