Understand tokenizer add_special_tokens with Examples in LLM – LLM Tutorial

By | January 10, 2024

As to a tokenizer instance, it contains add_special_tokens parameter. In this tutorial, we will introduce what it mean.

LLM tokenizer

For example, As to LlamaTokenizer, it may contains these parameters:

( vocab_fileunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'pad_token = Nonesp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadd_bos_token = Trueadd_eos_token = Falseclean_up_tokenization_spaces = Falseuse_default_system_prompt = Falsespaces_between_special_tokens = Falselegacy = None**kwargs )

LlamaTokenizer is also the child class of PreTrainedTokenizer.

We can find this truth from the source code.

class LlamaTokenizer(PreTrainedTokenizer):
    """
    Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding.

    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
    """

As to PreTrainedTokenizer, it contains these parameters:

( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs )

As to add_special_tokens, it is defined as:

add_special_tokens (bool, optional, defaults to True) — Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos or eos tokens automatically.

We can find build_inputs_with_special_tokens function determines the effect of add_special_tokens

As to LlamaTokenizer, we can find this source code:

    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
        eos_token_id = [self.eos_token_id] if self.add_eos_token else []

        output = bos_token_id + token_ids_0 + eos_token_id

        if token_ids_1 is not None:
            output = output + bos_token_id + token_ids_1 + eos_token_id

        return output

The effect of add_special_tokens

Here we will use an example to show the effect of add_special_tokens.

from transformers import LlamaConfig #load llama2 config
from tqdm import tqdm
from transformers import LlamaForCausalLM, LlamaForSequenceClassification, LlamaModel,LlamaTokenizer

model_path = r"D:\10_LLM\pretrained\LLM\llama2"

if __name__ == "__main__":
    config = LlamaConfig.from_pretrained(model_path)
    #print(config)
    tokenizer_1 = LlamaTokenizer.from_pretrained(model_path)
    model_inputs_1 = tokenizer_1(
        ["Hello word"],  return_tensors="pt",add_special_tokens=False
    )
    print(model_inputs_1)
    x1 = tokenizer_1.decode(model_inputs_1["input_ids"][0])
    print(x1)

Here add_special_tokens = False, we will get:

{'input_ids': tensor([[15043,  1734]]), 'attention_mask': tensor([[1, 1]])}
Hello word

If add_special_tokens = True

model_inputs_1 = tokenizer_1(
        ["Hello word"],  return_tensors="pt",add_special_tokens=True
    )

We will get:

{'input_ids': tensor([[    1, 15043,  1734]]), 'attention_mask': tensor([[1, 1, 1]])}
<s>Hello word

We should notice: when add_special_tokens = True, build_inputs_with_special_tokens will be run.