smji's picture
Update README.md
08732c6 verified
metadata
license: apache-2.0
language:
  - bn
metrics:
  - wer
  - cer
tags:
  - seq2seq
  - ipa
  - bengali
  - byt5
widget:
  - text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
    example_title: Narail Text
  - text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
    example_title: Rangpur Text
  - text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
    example_title: Chittagong Text
  - text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
    example_title: Kishoreganj Text
  - text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
    example_title: Narsingdi Text
  - text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
    example_title: Tangail Text

Regional bengali text to IPA transcription - byT5-small

This is a fine-tuned version of the google/byt5-small for the task of generating IPA transcriptions from regional bengali text. This was done on the dataset of the competition “ভাষামূল: মুখের ভাষার খোঁজে“ by Bengali.AI.

Model performance:

  • Word error rate (wer): 0.0124279344454407
  • Char error rate (cer): 0.00427635805681347

Supported district tokens:

  • Kishoreganj
  • Narail
  • Narsingdi
  • Chittagong
  • Rangpur
  • Tangail

Loading & using the model

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-byt5small")
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-byt5small")

"""
  The format of the input text MUST BE: <district> <bengali_text>
"""
text = "<district> bengali_text_here"
text_ids = tokenizer(text, return_tensors='pt').input_ids
model(text_ids)

Using the pipeline

# Use a pipeline as a high-level helper
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-byt5small", device=device)


"""
  `texts` must be in the format of: <district> <contents>
"""
outputs = pipe(texts, max_length=1024, batch_size=batch_size)

Credits

Done by S M Jishanul Islam, Sadia Ahmmed, Sahid Hossain Mustakim