Architecture

Tokenizer

Quick Answer

An algorithm that converts text into tokens for model input.

A tokenizer breaks text into tokens that the model can process. Different tokenizers produce different tokenizations: character-level, word-level, or subword-level. Subword tokenizers (byte-pair encoding, SentencePiece) are standard for LLMs. They balance coverage (avoiding unknown tokens) with efficiency (reasonable token counts). Tokenization is a critical step affecting both model quality and efficiency. Different models use different tokenizers. Using a model's correct tokenizer is essential for accurate token counting and reproducible results. Tokenization is deterministic but irreversible.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms