Fundamentals
Vision-Language Model
Quick Answer
A multimodal model that understands both images and text, enabling visual reasoning.
Vision-language models combine visual and textual understanding. They can answer questions about images, describe what they see, read text in images (OCR), and reason about visual content. Modern vision models like Claude's vision capabilities, GPT-4V, and Gemini can handle complex visual reasoning. They work by encoding images into visual tokens similar to text tokens, then processing them with the same transformer architecture. Applications include document analysis, content moderation, visual question answering, and accessibility. Vision models enable entirely new workflows previously requiring human review.
Last verified: 2026-04-08