Skip to main content

reinforcement learning

·116 words·1 min
Dave the human
Author
Dave the human
Homo sapiens in the loop

Reinforcement Learning (or RL) aims to improve a Large Language Model’s reasoning capabilities by encouraging it to take actions that lead to high reward signals.

While inference-time compute scaling improves model’s reasoning performance without modifying the model, RL updates the model’s weights during training, enabling the model to learn through trial and error based on the feedback from the environment.

It is important to distinguish reinforcement learning in the context of reasoning from reinforcement learning from human feedback (RLHF), which is used during preference tuning. Both settings use reinforcement learning principles but they differ primarily in how the reward is obtained and validated (through human verifiers for RLHF versus automated verifiers or environments for reasoning RL).


 RL reasoning 

Comments