Understand transformers.BertTokenizer with Examples – PyTorch Tutorial

By | June 13, 2023

When we plan to load a bert model or transformers model, we may use BertTokenizer to convert a sequence to ids. In this tutorial, we will introduce you how to use it.

Preliminary

In order to use BertTokenizer, we should load transformers first.

# -*- coding:utf-8 -*
from transformers import BertTokenizer
import torch

Load BertTokenizer

After we have imported BertTokenizer package, we should initialize it.

In this tutorial, we will use HuggingFace Transformers to load this tokenizer.

For example, we will use GanymedeNil/text2vec-large-chinese model, you can download it here.

https://huggingface.co/GanymedeNil/text2vec-large-chinese

GanymedeNil text2vec-large-chinese

Then, we can load it with the example code below.

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese')
print(tokenizer)
print(tokenizer.vocab_files_names)

Run this code, we will see:

BertTokenizer(name_or_path='GanymedeNil/text2vec-large-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)
{'vocab_file': 'vocab.txt'}

We will find:

  • BertTokenizer will use vocab.txt to be the vocabulary
  • vocab_size = 21128
  • model_max_length = 1000000000000000019884624838656
  • special_tokens={‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’}

These information can be foud in tokenizer_config.json, tokenizer.json and special_tokens_map.json

Convert sentences to ids

We can use tokenizer to convert sentences to ids directly. For example:

sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences,padding=True, truncation=True, return_tensors='pt')
print(encoded_input)

Run this code, we will see:

{'input_ids': tensor([[ 101, 1963,  862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305,
          102],
        [ 101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305,  102,    0,
            0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

It means tokenizer is a dictionary, which contains input_ids, attention_mask.

We can use encoded_input[“input_ids”] to get sentence ids.

Convert ids to sentences

Meanwhile, we also can convert ids to sentences. Here is the example code:

print(encoded_input["input_ids"].detach().cpu().tolist())
for ids in encoded_input["input_ids"].detach().cpu().tolist():
    print(tokenizer.decode(ids))

Run this code, we will see:

[101, 1963, 862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305, 102], [101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305, 102, 0, 0]]
[CLS] 如 何 更 换 花 呗 绑 定 银 行 卡 [SEP]
[CLS] 花 呗 更 改 绑 定 银 行 卡 [SEP] [PAD] [PAD]

From the result, we can find:

  • [CLS] and [SEP] are added at the beginning and end of each sentence automatically.
  • [PAD] are also be added automatically to make sure the length of all sentences are the same.