Discussion about this post

User's avatar
Harjas Sandhu's avatar

I wonder if this is something that will seem obvious in hindsight. OpenAI compared fine-turning to dog training in 2023 (https://openai.com/index/how-should-ai-systems-behave/), and intuitively, there's an upper limit to what a dog is capable learning of regardless of how good a dog trainer's reinforcement learning techniques are. Obviously AIs are not dogs, but reinforcement learning does seem like a brute-force way of trying to get your AI to spit out the right answer by rote.

Of course, this is not something that we can know in advance before someone actually does the research. Great post discussing a sick paper!

Expand full comment
Woolery's avatar

I don’t know enough about the field to have a feel for this, but the paper’s findings surprised me. I thought RL was simply a plus, like a general capability booster.

The paper, as far as I understood it, suggest RL with simple correctness rewards is great for funneling good answer likelihoods but not for teaching a language model brand‑new reasoning tricks. In fact it can stifle them. If you truly need fresh capabilities, you’ll have to feed it fresh information or rethink the training recipe.

I’d be curious if this effect depends to any degree on model size. And if tweaking reward signals, like giving the AI a biscuit for diversity or something, might help keep the search space from narrowing.

Expand full comment
1 more comment...

No posts