---
license: apache-2.0

widget:
- text: "Thủ đô của nước Việt Nam là <mask> Nội."
  example_title: "Example 1"
- text: "Cà phê được trồng nhiều ở khu vực Tây <mask> của Việt Nam."
  example_title: "Example 2"
---


# <a name="introduction"></a> CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings)

The pre-trained CafeBERT model is the state-of-the-art language model for Vietnamese *(Cafe or coffee is a popular drink every morning in Vietnam)*:

CafeBERT is a large-scale multilingual language model with strong support for Vietnamese. The model is based on XLM-Roberta (the state-of-the-art multilingual language model) and is enhanced with a large Vietnamese corpus with many domains: Wikipedia, newspapers... CafeBERT has outstanding performance on the VLUE benchmark and other tasks, such as machine reading comprehension, text classification, natural language inference, part-of-speech tagging...

The general architecture and experimental results of PhoBERT can be found in our [paper](https://arxiv.org/abs/2403.15882):

```
  @misc{do2024vlue,
        title={VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding}, 
        author={Phong Nguyen-Thuan Do and Son Quoc Tran and Phu Gia Hoang and Kiet Van Nguyen and Ngan Luu-Thuy Nguyen},
        year={2024},
        eprint={2403.15882},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }
```

Please **CITE** our paper when CafeBERT is used to help produce published results or is incorporated into other software.

**Installation** 

Install `transformers` and `SentencePiece` packages:
    
    pip install transformers
    pip install SentencePiece

**Example usage**
```python
from transformers import AutoModel, AutoTokenizer
import torch

model= AutoModel.from_pretrained('uitnlp/CafeBERT')
tokenizer = AutoTokenizer.from_pretrained('uitnlp/CafeBERT')

encoding = tokenizer('Cà phê được trồng nhiều ở khu vực Tây Nguyên của Việt Nam.', return_tensors='pt')

with torch.no_grad():
  output = model(**encoding)
```