bert-mini-amharic
This model has the same architecture as bert-mini and was pretrained from scratch using the Amharic subsets of the oscar and mc4 datasets, on a total of 137 Million
tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k.
It achieves the following results on the evaluation set:
Loss: 3.57
Perplexity: 35.52
Even though this model only has 9.7 Million
parameters, its performance is only slightly behind the 28x larger 279 Million
parameter xlm-roberta-base model on the same Amharic evaluation set.
How to use
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")
[{'score': 0.4713546335697174,
'token': 9308,
'token_str': 'ዓመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
{'score': 0.25726795196533203,
'token': 9540,
'token_str': 'ዓመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
{'score': 0.07067586481571198,
'token': 10354,
'token_str': 'አመት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
{'score': 0.07064681500196457,
'token': 11212,
'token_str': 'አመታት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
{'score': 0.012558948248624802,
'token': 10588,
'token_str': 'ወራት',
'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ወራት ተቆጥሯል ።'}]
Fine-tuning
The following github repository contains a notebook that fine-tunes this model for an Amharic text classification task.
https://github.com/rasyosef/amharic-news-category-classification
Fine-tuned Model Performance
Since this is a multi-class classification task, the reported precision, recall, and f1 metrics are macro averages.
Model | Size(# params) | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|
bert-mini-amharic | 9.67M | 0.87 | 0.83 | 0.83 | 0.83 |
bert-small-amharic | 25.7M | 0.89 | 0.86 | 0.87 | 0.86 |
xlm-roberta-base | 279M | 0.9 | 0.88 | 0.88 | 0.88 |
- Downloads last month
- 36