GPT tokenizer

Gabriel Gruenberger

Using byte pair encoding, I created a tokenizer which converts from natural language into an index which could be passed into the token embedding table of a language model. It it used by openAI and is useful as the vocabulary size can be adjusted depending on the individual requirements of the project when training the tokenizer.
Like this project

Posted Aug 9, 2024

I created the tokenizer which can be used to convert text into an array of indexes and then convert the indexes the model outputs back into text.

Join 50k+ companies and 1M+ independents

Contra Logo

© 2025 Contra.Work Inc