Additional Languages - Turkish

#1
by kemalcankara - opened

Hello, congratulations for this amazing work.

Do you have any plans to incorporate the Turkish language into this model? The Turkish language is widely studied in academia, and there is a significant community of individuals developing commercial applications with natural language processing (NLP). Additionally, it is worth noting that the government supports an annual competition specifically focused on Turkish NLP.

Technology Innovation Institute org

Not at this time, this is a primarily an English only model. We've added a some European languages which are related and should not incur too much of a performance penalty.

High quality multilingual models are an interesting topic which I'm sure we will get back to at some point though.

I was excited to hear that there was a model coming from an institution based in the UAE. I came here racing, expecting it to be versatile with Arabic, but was quite disappointed to find that it wasn't trained on it at all. Should we expect an upcoming version - in the near future - trained extensively on Arabic sources?

Does it support swedish language as good as open AI does?

Technology Innovation Institute org

Hi Kemal and Hatem,

For this model, we focused on English first and foremost, and added European languages for which we could gather enough data in our web crawl. To avoid issues with tokenization, we only included European languages using the latin alphabet.
We have also been working on state-of-the-art Arabic language models, and hopefully you get to hear about them soon 🀞.

@hassanback , we do not have good evaluation coverage in Swedish, so this is difficult to answer. Happy to hear back from you if you end up testing this!

FalconLLM changed discussion status to closed

First of all thank you very much for this model.
Turkish is a European language with latin alphabet, Turkey and its culture is very different than Arabic countries (by far) .Secular, latin alphabet , no islamic rule , totally free and governed by law. And far more democratic than most of the western countries however which is not enough for citizen thats why people find it anti-democratic (it is not relatively).

So I ve been giving a try to fine tune it with %15 of Stanford-alpaca instruction set translated to Turkish. It seems promising. Would it differ to fine tune it afterwards using QLORA orther than pretrain it ?
I am using instruction based json dataset. Would it be logical to give simple text such as wikipedia in Turkish, before giving instruction based data ?

Btw it is going like this:

Saving model checkpoint to ./falcon-40b-instruct-4bit-alpaca/checkpoint-5300
Trainer.model is not a PreTrainedModel, only saving its state dict.
Deleting older checkpoint [falcon-40b-instruct-4bit-alpaca/checkpoint-5150] due to args.save_total_limit
{'loss': 0.7811, 'learning_rate': 9.39177797950979e-05, 'epoch': 2.06}
{'loss': 0.9781, 'learning_rate': 9.372325249643366e-05, 'epoch': 2.06}
{'loss': 0.9802, 'learning_rate': 9.35287251977694e-05, 'epoch': 2.07}
{'loss': 0.7647, 'learning_rate': 9.333419789910517e-05, 'epoch': 2.07}
{'loss': 0.8621, 'learning_rate': 9.313967060044092e-05, 'epoch': 2.07}
{'loss': 1.0175, 'learning_rate': 9.294514330177668e-05, 'epoch': 2.07}
{'loss': 0.8003, 'learning_rate': 9.275061600311242e-05, 'epoch': 2.07}
{'loss': 0.9179, 'learning_rate': 9.255608870444818e-05, 'epoch': 2.08}
{'loss': 0.9157, 'learning_rate': 9.236156140578393e-05, 'epoch': 2.08}
{'loss': 0.9958, 'learning_rate': 9.216703410711969e-05, 'epoch': 2.08}
69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ...

Questions reminder : Pretrain vs QLORA ? First Simple Text than Instruction based using QLORA ?

I have finished that. Results are promising entire Stanford alpaca dataset should take a day using A100 40GB with falcon 40B.

Great job, can we try it somewhere if you do the entire dataset?

I dont plan to train entire dataset. However it would be very wise to make the model generate most probable answers to instructions (top_p top_k temperature) and then using gpt3.5-turbo api to translate them turkish and feeding them into model. Such very smaller dataset gave a lot better output than standford aplaca. Only 4k instructions exceeded my preivious 12k Stf-Alpc finetuning. Answered coherently to many questions. So If I try this again, I will try this that way with 20K instruction or so. Then I will share.

Bard says yes via fine tuning. Can LLMs be fine-tuned to add new languages?

Sign up or log in to comment