Inference is the runtime step where a trained model processes a new input and returns output. Training happens once (or periodically for fine-tuning); inference happens on every request, so its cost dominates the operating budget of any production AI system.
Key metrics: latency (time to first token, total time), throughput (tokens per second), and cost per million tokens. Frontier-model inference happens on specialised hardware (Nvidia H100/H200, Google TPU, AWS Trainium, Cerebras); optimisation includes batching, quantisation, and speculative decoding.