The number of tokens that can be handled by a tokenizer.
A larger vocabulary in [[LLM]]s:
- increases the model size because the [[embedding]] and output layers must store more token representations
- increases the per-token compute cost of producing next-token probabilities
- allows more words to be represented as single tokens rather than being split into subword components; this can reduce the sequence length since less tokens are required to represent a sentence
So the tradeoff is between a larger vocabulary with somewhat higher per-token cost and a smaller vocabulary that often produces longer token sequences.