Skip to main content

vocabulary

·94 words·1 min
Dave the human
Author
Dave the human
Homo sapiens in the loop

The number of tokens that can be handled by a tokenizer.

A larger vocabulary in [[LLM]]s:

  • increases the model size because the [[embedding]] and output layers must store more token representations
  • increases the per-token compute cost of producing next-token probabilities
  • allows more words to be represented as single tokens rather than being split into subword components; this can reduce the sequence length since less tokens are required to represent a sentence

So the tradeoff is between a larger vocabulary with somewhat higher per-token cost and a smaller vocabulary that often produces longer token sequences.


 RLHF encoding 

Comments