Glossary entry

Quantization

Reducing the numerical precision of model weights (e.g., 16-bit to 4-bit) to shrink memory and speed up inference at minor quality cost.

Quantization reduces the numerical precision of model weights. A 70B-parameter model in 16-bit precision needs 140GB of memory; quantized to 4-bit it needs 35GB and fits on a single consumer GPU.

Quantization is essential for running open-source models locally and for cost-effective serving at scale. Quality loss at 8-bit is usually undetectable; at 4-bit there is a measurable but tolerable drop on benchmarks. Tools like llama.cpp, GGUF format, and AWQ make quantization accessible to non-researchers.

Related terms

Written by

John Ethan

Founder & Editor-in-Chief

Founder of MytheAi. Tracking and reviewing AI and SaaS tools since January 2026. Built MytheAi out of frustration with pay-to-rank listicles and SEO-driven AI directories that prioritize ad revenue over honest guidance. Hands-on testing across 500+ tools to date.