[BUG/Help] ice_text.model词表长度与config里设置不一致

#65
by Au3C2 - opened

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

ice_text.model词表长度130344

>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> len(tokenizer.get_vocab())
130344

config:

"vocab_size": 130528

模型参数:

transformer.word_embeddings.embedding_table torch.Size([130528, 4096]) torch.float16
lm_head.weight torch.Size([130528, 4096]) torch.float16

词表长度不一致导致有时会生成词表外的词,然后索引越界退出

Expected Behavior

词表大小与config、模型参数一致

Steps To Reproduce

>>> from transformers import AutoTokenizer, AutoModel
>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
>>> len(tokenizer.get_vocab())
130344

Environment

- OS:
- Python:
- Transformers:
- PyTorch:
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Au3C2 changed discussion title from Au3C2 to [BUG/Help] ice_text.model词表长度与config里设置不一致

Sign up or log in to comment