Glossary entry

Multimodal

A model that accepts or produces multiple input/output types - text plus images, audio, or video.

A multimodal model handles more than one input or output type. GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro are all multimodal: they accept text plus images, and Gemini also accepts audio and video natively.

Multimodal capability matters for screenshot understanding, document OCR, video summarisation, and any workflow where text alone is insufficient. By 2026 most frontier models have native vision; audio is closing the gap; full video understanding still trails text quality but is improving rapidly.

Related terms

Written by

John Ethan

Founder & Editor-in-Chief

Founder of MytheAi. Tracking and reviewing AI and SaaS tools since January 2026. Built MytheAi out of frustration with pay-to-rank listicles and SEO-driven AI directories that prioritize ad revenue over honest guidance. Hands-on testing across 500+ tools to date.