Reinforcement Learning (or RL) aims to improve a Large Language Model’s reasoning capabilities by encouraging it to take actions that lead to high reward signals.
While inference-time compute scaling improves model’s reasoning performance without modifying the model, RL updates the model’s weights during training, enabling the model to learn through trial and error based on the feedback from the environment.
It is important to distinguish reinforcement learning in the context of reasoning from reinforcement learning from human feedback (RLHF), which is used during preference tuning. Both settings use reinforcement learning principles but they differ primarily in how the reward is obtained and validated (through human verifiers for RLHF versus automated verifiers or environments for reasoning RL).