Reinforcement Learning (with Verifiable…

May 4

Wake up babe, new arXiv paper just dropped

3 Comments

I wonder if this is something that will seem obvious in hindsight. OpenAI compared fine-turning to dog training in 2023 (https://openai.com/index/how-should-ai-systems-behave/), and intuitively, there's an upper limit to what a dog is capable learning of regardless of how good a dog trainer's reinforcement learning techniques are. Obviously AIs are not dogs, but reinforcement learning does seem like a brute-force way of trying to get your AI to spit out the right answer by rote.

Of course, this is not something that we can know in advance before someone actually does the research. Great post discussing a sick paper!

Expand full comment

Reply (1)

Michelle Ma

May 10

Thanks!

The OAI post seems to compare model training in general to dog training:

"Though not a perfect analogy, the process is more similar to training a dog than to ordinary programming. An initial “pre-training” phase comes first [...]"

I'm curious what is implied by roteness & brute-forcing, as this seems true of all model training.

The dog training analogy is a bit tricky---from what I understand, the paper is saying that we used to think that RLVF actually taught the dog new skills, like to roll over when you say "roll." But it turns out that actually, if you take an untrained dog and just tell it "roll" 128 times, it'll roll over at least once without being taught. So really, learning only happens in the efficiency sense (which is still important for usefulness & costs) rather than the true capabilities sense.

Maybe the word "capabilities" is kinda misleading here. If you can ace something 100% of the time & I can only ace it 10% of the time, I would consider you more "capable," but that's not exactly what the paper is gesturing at with AI.

I would be really curious as to whether this result generalizes to frontier models, as my intuition there is ambivalent.

Expand full comment

Woolery

May 4Edited

I don’t know enough about the field to have a feel for this, but the paper’s findings surprised me. I thought RL was simply a plus, like a general capability booster.

The paper, as far as I understood it, suggest RL with simple correctness rewards is great for funneling good answer likelihoods but not for teaching a language model brand‑new reasoning tricks. In fact it can stifle them. If you truly need fresh capabilities, you’ll have to feed it fresh information or rethink the training recipe.

I’d be curious if this effect depends to any degree on model size. And if tweaking reward signals, like giving the AI a biscuit for diversity or something, might help keep the search space from narrowing.

Expand full comment