Understand transformers.BertTokenizer with Examples

When we plan to load a bert model or transformers model, we may use BertTokenizer to convert a sequence to ids. In this tutorial, we will introduce you how to use it.

Preliminary

In order to use BertTokenizer, we should load transformers first.

# -*- coding:utf-8 -*
from transformers import BertTokenizer
import torch

Load BertTokenizer

After we have imported BertTokenizer package, we should initialize it.

In this tutorial, we will use HuggingFace Transformers to load this tokenizer.

For example, we will use GanymedeNil/text2vec-large-chinese model, you can download it here.

https://huggingface.co/GanymedeNil/text2vec-large-chinese

Then, we can load it with the example code below.

# Load model from HuggingFace Hub
tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese')
print(tokenizer)
print(tokenizer.vocab_files_names)

Run this code, we will see:

BertTokenizer(name_or_path='GanymedeNil/text2vec-large-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)
{'vocab_file': 'vocab.txt'}

We will find:

BertTokenizer will use vocab.txt to be the vocabulary
vocab_size = 21128
model_max_length = 1000000000000000019884624838656
special_tokens={‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’}

These information can be foud in tokenizer_config.json, tokenizer.json and special_tokens_map.json

Convert sentences to ids

We can use tokenizer to convert sentences to ids directly. For example:

sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences,padding=True, truncation=True, return_tensors='pt')
print(encoded_input)

Run this code, we will see:

{'input_ids': tensor([[ 101, 1963,  862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305,
          102],
        [ 101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305,  102,    0,
            0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

It means tokenizer is a dictionary, which contains input_ids, attention_mask.

We can use encoded_input[“input_ids”] to get sentence ids.

Convert ids to sentences

Meanwhile, we also can convert ids to sentences. Here is the example code:

print(encoded_input["input_ids"].detach().cpu().tolist())
for ids in encoded_input["input_ids"].detach().cpu().tolist():
    print(tokenizer.decode(ids))

Run this code, we will see:

[101, 1963, 862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305, 102], [101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305, 102, 0, 0]]
[CLS] 如 何 更 换 花 呗 绑 定 银 行 卡 [SEP]
[CLS] 花 呗 更 改 绑 定 银 行 卡 [SEP] [PAD] [PAD]

From the result, we can find:

[CLS] and [SEP] are added at the beginning and end of each sentence automatically.
[PAD] are also be added automatically to make sure the length of all sentences are the same.