More

t55 · 2025-07-15T11:20:39 1752578439

that's a standard feature in cursor, windsurf, etc.

t55 · 2025-06-30T19:23:44 1751311424

this is what deepmind did 10 years ago lol

smokel · 2025-06-30T20:54:02 1751316842

No, they (and many others before them) are genuinely trying to improve on the original research.

The original paper "Playing Atari with Deep Reinforcement Learning" (2013) from Deepmind describes how agents can play Atari games, but these agents would have to be specifically trained on every individual game using millions of frames. To accomplish this, simulators were run in parallel, and much faster than in real-time.

Also, additional trickery was added to extract a reward signal from the games, and there is some minor cheating on supplying inputs.

What Carmack (and others before him) is interested in, is trying to learn in a real-life setting, similar to how humans learn.

t55 · 2025-06-02T14:59:55 1748876395

For a 100k token context window; all those models are comparable though

gemini 2.5 pro shines for 200k+ tokens

CuriouslyC · 2025-06-02T17:07:52 1748884072

I can confirm from first hand experience that even at 100k they are most definitely not comparable for the task of beta reading.

throwaway314155 · 2025-06-02T17:57:55 1748887075

splitting hairs much?

t55 · 2025-06-02T14:07:36 1748873256

yeah, RLVR is still nascent and hence there's lots of noise.

> How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards?

it's because in those cases, RLVR merely elicits the reasoning strategies already contained in the model through pre-training

this paper, which uses Reasoning gym, shows that you need to train for way longer than those papers you mentioned to actually uncover novel reasoning strategies: https://arxiv.org/abs/2505.24864

t55 · 2025-06-02T14:02:50 1748872970

agree, the RG evals feel like a fresh breeze

t55 · 2025-06-02T11:12:01 1748862721

> prolonged RL training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling

does this mean that previous RL papers claiming the opposite were possibly bottlenecked by small datasets?

yorwba · 2025-06-02T12:46:40 1748868400

No, they do not point to any specific examples of novel reasoning strategies that were uncovered, nor is their sampling that extensive (at most 256 samples vs the 2048 used in https://limit-of-rlvr.github.io/ ).

grad62304977 · 2025-06-06T16:13:31 1749226411

Seems unreasonable to say that in figure 5 for example, that more sampling (of a reasonable amount) would push the base to 100%

t55 · 2025-06-02T14:03:22 1748873002

so you think it's fake news? another example of a paper with strong claims without much evidence?

yorwba · 2025-06-02T18:28:18 1748888898

I think it's a case of not coming up with alternative explanations for the observed evidence and hence not designing experiments to distinguish between those explanations.

Their results are consistent with novel reasoning strategies, but they're also consistent with more reliable execution of reasoning strategies that the base model can generate in principle, but rarely succeeds at due to a large number of steps. (If you have a model that can do each step independently with 99% success rate and getting the correct result requires 1000 steps, the chance of making it all the way to the end without a single error is only about 0.004%.)

psb217 · 2025-06-02T18:55:05 1748890505

One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.

I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".

yorwba · 2025-06-02T20:00:10 1748894410

While it's true that the model assigns non-zero probabilities to all sequences by design, those probabilities can get a lot smaller. E.g. replace that 99% per-step success probability with 10% and suddenly the overall chance of a correct result is truly astronomically small.

For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)

t55 · 2025-06-02T10:37:13 1748860633

> I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling).

Given that GDM pioneered RL, that's a reasonable assumption

flowerthoughts · 2025-06-02T11:15:03 1748862903

Assuming with GDM, you mean Google-Deep Mind. They pioneered RL with deep nets as policy function estimator. The deep nets being a result of CNNs and massive improvements in hardware parallelization at the time.

RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning

t55 · 2025-06-02T14:08:40 1748873320

i didn't say they invented everything; in science you always stand on the shoulders of giants

i still think my original statement is fair

lechatonnoir · 2025-06-02T20:51:51 1748897511

"gdm pioneered rl" is definitely not actually right, but it's correct to assert that they were huge players.

people who knew from context that your statement was broadly not actually right would know what you mean and agree on vibes. people who didn't could reasonably be misled, i think.

t55 · 2025-06-02T09:28:40 1748856520

it aged well!

t55 · 2025-05-16T17:05:19 1747415119

Very cool, reminds me of https://rehearsal.so/

which sort of interviews will you support?

fairAndBased · 2025-05-16T18:30:00 1747420200

Thanks!

Right now we're focused on software engineer interviews—mainly Leetcode-style and a bit of system design. We're still in early stages, so we're mainly trying to gather feedback and see what people actually want. PMs, MLEs, and Manager interviews are definitely on our radar, but we'll expand based on demand.

t55 · 2025-05-08T17:43:52 1746726232

great article!