Glossary entry

Inference

The act of running a trained model on new input to produce output. Distinct from training. Inference cost dominates production budgets.

Inference is the runtime step where a trained model processes a new input and returns output. Training happens once (or periodically for fine-tuning); inference happens on every request, so its cost dominates the operating budget of any production AI system.

Key metrics: latency (time to first token, total time), throughput (tokens per second), and cost per million tokens. Frontier-model inference happens on specialised hardware (Nvidia H100/H200, Google TPU, AWS Trainium, Cerebras); optimisation includes batching, quantisation, and speculative decoding.

Related terms

Written by

John Ethan

Founder & Editor-in-Chief

Founder of MytheAi. Tracking and reviewing AI and SaaS tools since January 2026. Built MytheAi out of frustration with pay-to-rank listicles and SEO-driven AI directories that prioritize ad revenue over honest guidance. Hands-on testing across 500+ tools to date.