smji commited on
Commit
18d9156
1 Parent(s): c137030

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -1
README.md CHANGED
@@ -1,3 +1,76 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - bn
5
+ metrics:
6
+ - wer
7
+ - cer
8
+ tags:
9
+ - seq2seq
10
+ - ipa
11
+ - bengali
12
+ - byt5
13
+ widget:
14
+ - text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম।
15
+ example_title: Narail Text
16
+ - text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ।
17
+ example_title: Rangpur Text
18
+ - text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা!
19
+ example_title: Chittagong Text
20
+ - text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি।
21
+ example_title: Kishoreganj Text
22
+ - text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না।
23
+ example_title: Narsingdi Text
24
+ - text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ?
25
+ example_title: Tangail Text
26
  ---
27
+
28
+ # Regional bengali text to IPA transcription - umt5-base
29
+
30
+
31
+ This is a fine-tuned version of the [google/umt5-base](https://huggingface.co/google/umt5-base) for the task of generating IPA transcriptions from regional bengali text.
32
+ This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI.
33
+
34
+ Scores achieved till now (test scores):
35
+ - **Word error rate (wer)**: 0.02390405721962450
36
+ - **Char error rate (cer)**: 0.01011514943093060
37
+
38
+ Supported district tokens:
39
+ - Kishoreganj
40
+ - Narail
41
+ - Narsingdi
42
+ - Chittagong
43
+ - Rangpur
44
+ - Tangail
45
+
46
+ ---
47
+
48
+ ## Loading & using the model
49
+ ```python
50
+ # Load model directly
51
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
52
+ tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
53
+ model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-umt5base")
54
+ """
55
+ The format of the input text MUST BE: <district> <bengali_text>
56
+ """
57
+ text = "<district> bengali_text_here"
58
+ text_ids = tokenizer(text, return_tensors='pt').input_ids
59
+ model(text_ids)
60
+ ```
61
+
62
+
63
+ ## Using the pipeline
64
+ ```python
65
+ # Use a pipeline as a high-level helper
66
+ from transformers import pipeline
67
+ device = "cuda" if torch.cuda.is_available() else "cpu"
68
+ pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-umt5base", device=device)
69
+ """
70
+ `texts` must be in the format of: <district> <contents>
71
+ """
72
+ outputs = pipe(texts, max_length=512, batch_size=batch_size)
73
+ ```
74
+
75
+ ## Credits
76
+ Done by [S M Jishanul Islam](https://huggingface.co/smji), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15)