When we plan to load a bert model or transformers model, we may use BertTokenizer to convert a sequence to ids. In this tutorial, we will introduce you how to use it.
Preliminary
In order to use BertTokenizer, we should load transformers first.
# -*- coding:utf-8 -* from transformers import BertTokenizer import torch
Load BertTokenizer
After we have imported BertTokenizer package, we should initialize it.
In this tutorial, we will use HuggingFace Transformers to load this tokenizer.
For example, we will use GanymedeNil/text2vec-large-chinese model, you can download it here.
https://huggingface.co/GanymedeNil/text2vec-large-chinese
Then, we can load it with the example code below.
# Load model from HuggingFace Hub tokenizer = BertTokenizer.from_pretrained('GanymedeNil/text2vec-large-chinese') print(tokenizer) print(tokenizer.vocab_files_names)
Run this code, we will see:
BertTokenizer(name_or_path='GanymedeNil/text2vec-large-chinese', vocab_size=21128, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True) {'vocab_file': 'vocab.txt'}
We will find:
- BertTokenizer will use vocab.txt to be the vocabulary
- vocab_size = 21128
- model_max_length = 1000000000000000019884624838656
- special_tokens={‘unk_token’: ‘[UNK]’, ‘sep_token’: ‘[SEP]’, ‘pad_token’: ‘[PAD]’, ‘cls_token’: ‘[CLS]’, ‘mask_token’: ‘[MASK]’}
These information can be foud in tokenizer_config.json, tokenizer.json and special_tokens_map.json
Convert sentences to ids
We can use tokenizer to convert sentences to ids directly. For example:
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] # Tokenize sentences encoded_input = tokenizer(sentences,padding=True, truncation=True, return_tensors='pt') print(encoded_input)
Run this code, we will see:
{'input_ids': tensor([[ 101, 1963, 862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305, 102], [ 101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305, 102, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
It means tokenizer is a dictionary, which contains input_ids, attention_mask.
We can use encoded_input[“input_ids”] to get sentence ids.
Convert ids to sentences
Meanwhile, we also can convert ids to sentences. Here is the example code:
print(encoded_input["input_ids"].detach().cpu().tolist()) for ids in encoded_input["input_ids"].detach().cpu().tolist(): print(tokenizer.decode(ids))
Run this code, we will see:
[101, 1963, 862, 3291, 2940, 5709, 1446, 5308, 2137, 7213, 6121, 1305, 102], [101, 5709, 1446, 3291, 3121, 5308, 2137, 7213, 6121, 1305, 102, 0, 0]] [CLS] 如 何 更 换 花 呗 绑 定 银 行 卡 [SEP] [CLS] 花 呗 更 改 绑 定 银 行 卡 [SEP] [PAD] [PAD]
From the result, we can find:
- [CLS] and [SEP] are added at the beginning and end of each sentence automatically.
- [PAD] are also be added automatically to make sure the length of all sentences are the same.