--- license: apache-2.0 widget: - text: "Thủ đô của nước Việt Nam là Nội." example_title: "Example 1" - text: "Cà phê được trồng nhiều ở khu vực Tây của Việt Nam." example_title: "Example 2" --- # CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings) The pre-trained CafeBERT model is the state-of-the-art language model for Vietnamese *(Cafe or coffee is a popular drink every morning in Vietnam)*: CafeBERT is a large-scale multilingual language model with strong support for Vietnamese. The model is based on XLM-Roberta (the state-of-the-art multilingual language model) and is enhanced with a large Vietnamese corpus with many domains: Wikipedia, newspapers... CafeBERT has outstanding performance on the VLUE benchmark and other tasks, such as machine reading comprehension, text classification, natural language inference, part-of-speech tagging... The general architecture and experimental results of PhoBERT can be found in our [paper](https://arxiv.org/abs/2403.15882): ``` @misc{do2024vlue, title={VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding}, author={Phong Nguyen-Thuan Do and Son Quoc Tran and Phu Gia Hoang and Kiet Van Nguyen and Ngan Luu-Thuy Nguyen}, year={2024}, eprint={2403.15882}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` Please **CITE** our paper when CafeBERT is used to help produce published results or is incorporated into other software. **Installation** Install `transformers` and `SentencePiece` packages: pip install transformers pip install SentencePiece **Example usage** ```python from transformers import AutoModel, AutoTokenizer import torch model= AutoModel.from_pretrained('uitnlp/CafeBERT') tokenizer = AutoTokenizer.from_pretrained('uitnlp/CafeBERT') encoding = tokenizer('Cà phê được trồng nhiều ở khu vực Tây Nguyên của Việt Nam.', return_tensors='pt') with torch.no_grad(): output = model(**encoding) ```