Python tiktoken Tutorial with Examples

By | October 29, 2024

tiktoken is a fast BPE tokeniser for OpenAI’s models. In this tutorial, we will use some examples to show you how to use it.

Install

You can use pip to install it.

pip install tiktoken

Supported Models

tiktoken supports several open ai models, you can find them in tiktoken/model.py

For example:

MODEL_TO_ENCODING: dict[str, str] = {
    # chat
    "gpt-4o": "o200k_base",
    "gpt-4": "cl100k_base",
    "gpt-3.5-turbo": "cl100k_base",
    "gpt-3.5": "cl100k_base",  # Common shorthand
    "gpt-35-turbo": "cl100k_base",  # Azure deployment name
    # base
    "davinci-002": "cl100k_base",
    "babbage-002": "cl100k_base",
    # embeddings
    "text-embedding-ada-002": "cl100k_base",
    "text-embedding-3-small": "cl100k_base",
    "text-embedding-3-large": "cl100k_base",
    # DEPRECATED MODELS
    # text (DEPRECATED)
    "text-davinci-003": "p50k_base",
    "text-davinci-002": "p50k_base",
    "text-davinci-001": "r50k_base",
    "text-curie-001": "r50k_base",
    ...
}

tiktoken supports models

Then, we can use it to convert string to id.

Create a encoder based model name

For example:

import tiktoken

content = "this is a test example, tutorialexample.com"
ENCODER = tiktoken.encoding_for_model("gpt-4o")

Here we will use gpt-4o model to create an encoder.

tokens = ENCODER.encode(content)
print(tokens)

Run this code, we will get:

[851, 382, 261, 1746, 4994, 11, 24000, 18582, 1136]

Decode

We also can use token ids to get original string.

ctx = ENCODER.decode(tokens)
print(ctx)

Output:

this is a test example, tutorialexample.com

If you want to know what token of each token id is, you can use code below:

print(ENCODER.decode_tokens_bytes(tokens))

Output:

[b'this', b' is', b' a', b' test', b' example', b',', b' tutorial', b'example', b'.com']

Leave a Reply