Additional Languages - Turkish

by kemalcankara - opened May 26, 2023

May 26, 2023

Hello, congratulations for this amazing work.

Do you have any plans to incorporate the Turkish language into this model? The Turkish language is widely studied in academia, and there is a significant community of individuals developing commercial applications with natural language processing (NLP). Additionally, it is worth noting that the government supports an annual competition specifically focused on Turkish NLP.

DanielHesslow

Technology Innovation Institute org May 26, 2023

Not at this time, this is a primarily an English only model. We've added a some European languages which are related and should not incur too much of a performance penalty.

High quality multilingual models are an interesting topic which I'm sure we will get back to at some point though.

HatemH

May 26, 2023

I was excited to hear that there was a model coming from an institution based in the UAE. I came here racing, expecting it to be versatile with Arabic, but was quite disappointed to find that it wasn't trained on it at all. Should we expect an upcoming version - in the near future - trained extensively on Arabic sources?

hassanback

May 27, 2023

Does it support swedish language as good as open AI does?

FalconLLM

Technology Innovation Institute org May 30, 2023

Hi Kemal and Hatem,

For this model, we focused on English first and foremost, and added European languages for which we could gather enough data in our web crawl. To avoid issues with tokenization, we only included European languages using the latin alphabet.
We have also been working on state-of-the-art Arabic language models, and hopefully you get to hear about them soon 🤞.

@hassanback , we do not have good evaluation coverage in Swedish, so this is difficult to answer. Happy to hear back from you if you end up testing this!

FalconLLM changed discussion status to closed May 30, 2023

TeaCult

Jul 3, 2023

•

edited Jul 3, 2023

First of all thank you very much for this model.
Turkish is a European language with latin alphabet, Turkey and its culture is very different than Arabic countries (by far) .Secular, latin alphabet , no islamic rule , totally free and governed by law. And far more democratic than most of the western countries however which is not enough for citizen thats why people find it anti-democratic (it is not relatively).

So I ve been giving a try to fine tune it with %15 of Stanford-alpaca instruction set translated to Turkish. It seems promising. Would it differ to fine tune it afterwards using QLORA orther than pretrain it ?
I am using instruction based json dataset. Would it be logical to give simple text such as wikipedia in Turkish, before giving instruction based data ?

Btw it is going like this:

Saving model checkpoint to ./falcon-40b-instruct-4bit-alpaca/checkpoint-5300
Trainer.model is not a PreTrainedModel, only saving its state dict.
Deleting older checkpoint [falcon-40b-instruct-4bit-alpaca/checkpoint-5150] due to args.save_total_limit
{'loss': 0.7811, 'learning_rate': 9.39177797950979e-05, 'epoch': 2.06}
{'loss': 0.9781, 'learning_rate': 9.372325249643366e-05, 'epoch': 2.06}
{'loss': 0.9802, 'learning_rate': 9.35287251977694e-05, 'epoch': 2.07}
{'loss': 0.7647, 'learning_rate': 9.333419789910517e-05, 'epoch': 2.07}
{'loss': 0.8621, 'learning_rate': 9.313967060044092e-05, 'epoch': 2.07}
{'loss': 1.0175, 'learning_rate': 9.294514330177668e-05, 'epoch': 2.07}
{'loss': 0.8003, 'learning_rate': 9.275061600311242e-05, 'epoch': 2.07}
{'loss': 0.9179, 'learning_rate': 9.255608870444818e-05, 'epoch': 2.08}
{'loss': 0.9157, 'learning_rate': 9.236156140578393e-05, 'epoch': 2.08}
{'loss': 0.9958, 'learning_rate': 9.216703410711969e-05, 'epoch': 2.08}
69%|███████████████ ...

Questions reminder : Pretrain vs QLORA ? First Simple Text than Instruction based using QLORA ?

TeaCult

Jul 6, 2023

I have finished that. Results are promising entire Stanford alpaca dataset should take a day using A100 40GB with falcon 40B.

kemalcankara

Jul 6, 2023

Great job, can we try it somewhere if you do the entire dataset?

TeaCult

Jul 27, 2023

•

edited Jul 27, 2023

I dont plan to train entire dataset. However it would be very wise to make the model generate most probable answers to instructions (top_p top_k temperature) and then using gpt3.5-turbo api to translate them turkish and feeding them into model. Such very smaller dataset gave a lot better output than standford aplaca. Only 4k instructions exceeded my preivious 12k Stf-Alpc finetuning. Answered coherently to many questions. So If I try this again, I will try this that way with 20K instruction or so. Then I will share.

alayaran

Sep 22, 2023

Bard says yes via fine tuning. Can LLMs be fine-tuned to add new languages?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment