Architecture

SentencePiece

Quick Answer

A language-independent tokenization algorithm treating raw text without assuming word boundaries.

SentencePiece is a tokenization library that learns subword units directly from raw text, without assuming word boundaries. This makes it language-agnostic—it works equally well for languages without spaces. SentencePiece uses byte-pair encoding or unigram language models to learn the vocabulary. It's widely used in modern models including LLaMA. SentencePiece simplifies preprocessing since you feed raw text directly. It's particularly effective for multilingual models.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →