Edit model card

bert-mini-amharic

This model has the same architecture as bert-mini and was pretrained from scratch using the Amharic subsets of the oscar and mc4 datasets, on a total of 137 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 24k. It achieves the following results on the evaluation set:

  • Loss: 3.57
  • Perplexity: 35.52

Even though this model only has 9.7 Million parameters, its performance is only slightly behind the 28x larger 279 Million parameter xlm-roberta-base model on the same Amharic evaluation set.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-mini-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.4713546335697174,
  'token': 9308,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.25726795196533203,
  'token': 9540,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.07067586481571198,
  'token': 10354,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.07064681500196457,
  'token': 11212,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.012558948248624802,
  'token': 10588,
  'token_str': 'ወራት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ወራት ተቆጥሯል ።'}]

Fine-tuning

The following github repository contains a notebook that fine-tunes this model for an Amharic text classification task.

https://github.com/rasyosef/amharic-news-category-classification

Fine-tuned Model Performance

Since this is a multi-class classification task, the reported precision, recall, and f1 metrics are macro averages.

Model Size(# params) Accuracy Precision Recall F1
bert-mini-amharic 9.67M 0.87 0.83 0.83 0.83
bert-small-amharic 25.7M 0.89 0.86 0.87 0.86
xlm-roberta-base 279M 0.9 0.88 0.88 0.88
Downloads last month
36
Safetensors
Model size
9.67M params
Tensor type
F32
·

Datasets used to train rasyosef/bert-mini-amharic