Architecture
Tokenizer
Quick Answer
An algorithm that converts text into tokens for model input.
A tokenizer breaks text into tokens that the model can process. Different tokenizers produce different tokenizations: character-level, word-level, or subword-level. Subword tokenizers (byte-pair encoding, SentencePiece) are standard for LLMs. They balance coverage (avoiding unknown tokens) with efficiency (reasonable token counts). Tokenization is a critical step affecting both model quality and efficiency. Different models use different tokenizers. Using a model's correct tokenizer is essential for accurate token counting and reproducible results. Tokenization is deterministic but irreversible.
Last verified: 2026-04-08