GPT tokenizer

Gabriel Gruenberger

ML Engineer

Software Engineer

AI Model Developer

Jupyter

Python

Visual Studio Code

Using byte pair encoding, I created a tokenizer which converts from natural language into an index which could be passed into the token embedding table of a language model. It it used by openAI and is useful as the vocabulary size can be adjusted depending on the individual requirements of the project when training the tokenizer.

Like this project

Posted Aug 9, 2024

I created the tokenizer which can be used to convert text into an array of indexes and then convert the indexes the model outputs back into text.

Likes

Views

GPT tokenizer

Join 50k+ companies and 1M+ independents