tiktoken is a fast BPE tokeniser for OpenAI’s models. In this tutorial, we will use some examples to show you how to use it.
Install
You can use pip to install it.
pip install tiktoken
Supported Models
tiktoken supports several open ai models, you can find them in tiktoken/model.py
For example:
MODEL_TO_ENCODING: dict[str, str] = { # chat "gpt-4o": "o200k_base", "gpt-4": "cl100k_base", "gpt-3.5-turbo": "cl100k_base", "gpt-3.5": "cl100k_base", # Common shorthand "gpt-35-turbo": "cl100k_base", # Azure deployment name # base "davinci-002": "cl100k_base", "babbage-002": "cl100k_base", # embeddings "text-embedding-ada-002": "cl100k_base", "text-embedding-3-small": "cl100k_base", "text-embedding-3-large": "cl100k_base", # DEPRECATED MODELS # text (DEPRECATED) "text-davinci-003": "p50k_base", "text-davinci-002": "p50k_base", "text-davinci-001": "r50k_base", "text-curie-001": "r50k_base", ... }
Then, we can use it to convert string to id.
Create a encoder based model name
For example:
import tiktoken content = "this is a test example, tutorialexample.com" ENCODER = tiktoken.encoding_for_model("gpt-4o")
Here we will use gpt-4o model to create an encoder.
tokens = ENCODER.encode(content) print(tokens)
Run this code, we will get:
[851, 382, 261, 1746, 4994, 11, 24000, 18582, 1136]
Decode
We also can use token ids to get original string.
ctx = ENCODER.decode(tokens) print(ctx)
Output:
this is a test example, tutorialexample.com
If you want to know what token of each token id is, you can use code below:
print(ENCODER.decode_tokens_bytes(tokens))
Output:
[b'this', b' is', b' a', b' test', b' example', b',', b' tutorial', b'example', b'.com']