Why are "add_bos_token" and "add_eos_token" missing in tokenizer_config.json ?

#140
by ekurtic - opened

Without these two in the tokenizer_config.json, I find it impossible to initialize the Llama-3 tokenizer with disabled adding of the BOS token.

This behaves as expected:

from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

>>> llama2_tok("hello")
{'input_ids': [1, 22172], 'attention_mask': [1, 1]}

>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]}

As we can see here, BOS tokens are added correctly for both tokenizers.

Let's now try to disable adding of the BOS token and enable adding of the EOS token:

from transformers import AutoTokenizer
llama2_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_bos_token=False, add_eos_token=True)
llama3_tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B", add_bos_token=False, add_eos_token=True)

>>> llama2_tok("hello")
{'input_ids': [22172, 2], 'attention_mask': [1, 1]}             <----- Good. BOS token not added, EOS token added.

>>> llama3_tok("hello")
{'input_ids': [128000, 15339], 'attention_mask': [1, 1]}       <---- Not good. BOS token added, EOS not added.

As can be seen, Llama-3 completely ignored the given add_bos_token and add_eos_token.
From what I have been able to trace, this might be due to the missing add_bos_token and add_eos_token in the tokenizer_config.json of the Llama-3 model.

Sign up or log in to comment