A multimodal model handles more than one input or output type. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all multimodal: they accept text plus images, and Gemini also accepts audio and video natively.
Multimodal capability matters for screenshot understanding, document OCR, video summarisation, and any workflow where text alone is insufficient. By 2026 most frontier models have native vision; audio is closing the gap; full video understanding still trails text quality but is improving rapidly.