Using byte pair encoding, I created a tokenizer which converts from natural language into an index which could be passed into the token embedding table of a language model. It it used by openAI and is useful as the vocabulary size can be adjusted depending on the individual requirements of the project when training the tokenizer.