As to a tokenizer instance, it contains add_special_tokens parameter. In this tutorial, we will introduce what it mean.
LLM tokenizer
For example, As to LlamaTokenizer, it may contains these parameters:
( vocab_fileunk_token = '<unk>'bos_token = '<s>'eos_token = '</s>'pad_token = Nonesp_model_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadd_bos_token = Trueadd_eos_token = Falseclean_up_tokenization_spaces = Falseuse_default_system_prompt = Falsespaces_between_special_tokens = Falselegacy = None**kwargs )
LlamaTokenizer is also the child class of PreTrainedTokenizer.
We can find this truth from the source code.
class LlamaTokenizer(PreTrainedTokenizer): """ Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. Args: vocab_file (`str`): Path to the vocabulary file. """
As to PreTrainedTokenizer, it contains these parameters:
( text: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Nonetext_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]]] = Nonetext_pair_target: typing.Union[str, typing.List[str], typing.List[typing.List[str]], NoneType] = Noneadd_special_tokens: bool = Truepadding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = Falsetruncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy] = Nonemax_length: typing.Optional[int] = Nonestride: int = 0is_split_into_words: bool = Falsepad_to_multiple_of: typing.Optional[int] = Nonereturn_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = Nonereturn_token_type_ids: typing.Optional[bool] = Nonereturn_attention_mask: typing.Optional[bool] = Nonereturn_overflowing_tokens: bool = Falsereturn_special_tokens_mask: bool = Falsereturn_offsets_mapping: bool = Falsereturn_length: bool = Falseverbose: bool = True**kwargs )
As to add_special_tokens, it is defined as:
add_special_tokens (bool, optional, defaults to True) — Whether or not to add special tokens when encoding the sequences. This will use the underlying PretrainedTokenizerBase.build_inputs_with_special_tokens function, which defines which tokens are automatically added to the input ids. This is usefull if you want to add bos or eos tokens automatically.
We can find build_inputs_with_special_tokens function determines the effect of add_special_tokens
As to LlamaTokenizer, we can find this source code:
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): bos_token_id = [self.bos_token_id] if self.add_bos_token else [] eos_token_id = [self.eos_token_id] if self.add_eos_token else [] output = bos_token_id + token_ids_0 + eos_token_id if token_ids_1 is not None: output = output + bos_token_id + token_ids_1 + eos_token_id return output
The effect of add_special_tokens
Here we will use an example to show the effect of add_special_tokens.
from transformers import LlamaConfig #load llama2 config from tqdm import tqdm from transformers import LlamaForCausalLM, LlamaForSequenceClassification, LlamaModel,LlamaTokenizer model_path = r"D:\10_LLM\pretrained\LLM\llama2" if __name__ == "__main__": config = LlamaConfig.from_pretrained(model_path) #print(config) tokenizer_1 = LlamaTokenizer.from_pretrained(model_path) model_inputs_1 = tokenizer_1( ["Hello word"], return_tensors="pt",add_special_tokens=False ) print(model_inputs_1) x1 = tokenizer_1.decode(model_inputs_1["input_ids"][0]) print(x1)
Here add_special_tokens = False, we will get:
{'input_ids': tensor([[15043, 1734]]), 'attention_mask': tensor([[1, 1]])} Hello word
If add_special_tokens = True
model_inputs_1 = tokenizer_1( ["Hello word"], return_tensors="pt",add_special_tokens=True )
We will get:
{'input_ids': tensor([[ 1, 15043, 1734]]), 'attention_mask': tensor([[1, 1, 1]])} <s>Hello word
We should notice: when add_special_tokens = True, build_inputs_with_special_tokens will be run.