inference-time compute scaling

Inference-time compute scaling (also called inference-compute scaling or test-time scaling) is a technique that aims to improve a Large Language Model’s reasoning capabilities at inference time without training or modifying the underlying model weights.

The core idea is to trade off increased computational resources for improved performance; in this way, even fixed models can become more capable through techniques like chain-of-thought (COT) and various sampling procedures.