Kocdigital-LLM-8b-v0.1

This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. The training process involved using the QLORA method.

Model Details

Base Model: Llama3 8B based LLM
Training Dataset: High Quality Turkish instruction sets
Training Method: SFT with QLORA

QLORA Fine-Tuning Configuration

lora_alpha: 128
lora_dropout: 0
r: 64
target_modules: "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
bias: "none"

Usage Examples


from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"KOCDIGITAL/Kocdigital-LLM-8b-v0.1", 
max_seq_length=4096)
model = AutoModelForCausalLM.from_pretrained(
    "KOCDIGITAL/Kocdigital-LLM-8b-v0.1",
    load_in_4bit=True,
)

system = 'Sen Türkçe konuşan genel amaçlı bir asistansın. Her zaman kullanıcının verdiği talimatları doğru, kısa ve güzel bir gramer ile yerine getir.'

template = "{}\n\n###Talimat\n{}\n###Yanıt\n"
content = template.format(system, 'Türkiyenin 3 büyük ilini listeler misin.')

conv = []
conv.append({'role': 'user', 'content': content})
inputs = tokenizer.apply_chat_template(conv, 
                                       tokenize=False, 
                                       add_generation_prompt=True, 
                                       return_tensors="pt")

print(inputs)

inputs = tokenizer([inputs], 
                   return_tensors = "pt",
                   add_special_tokens=False).to("cuda")

outputs = model.generate(**inputs, 
                         max_new_tokens = 512, 
                         use_cache = True, 
                         do_sample = True, 
                         top_k = 50, 
                         top_p = 0.60, 
                         temperature = 0.3, 
                         repetition_penalty=1.1)

out_text = tokenizer.batch_decode(outputs)[0]
print(out_text)

[Open LLM Turkish Leaderboard v0.2 Evaluation Results]

Metric	Value
Avg.	49.11
AI2 Reasoning Challenge_tr-v0.2	44.03
HellaSwag_tr-v0.2	46.73
MMLU_tr-v0.2	49.11
TruthfulQA_tr-v0.2	48.51
Winogrande _tr-v0.2	54.98
GSM8k_tr-v0.2	51.78

Considerations on Limitations, Risks, Bias, and Ethical Factors

Limitations and Recognized Biases

Core Functionality and Usage: KocDigital LLM, functioning as an autoregressive language model, is primarily purposed for predicting the subsequent token within a text sequence. Although commonly applied across different contexts, it's crucial to acknowledge that comprehensive real-world testing has not been conducted. Therefore, its efficacy and consistency in diverse situations are largely unvalidated.
Language Understanding and Generation: The model's training is mainly focused on standard English and Turkish. Its proficiency in grasping and generating slang, colloquial language, or different languages might be restricted, possibly resulting in errors or misinterpretations.
Production of Misleading Information: Users should acknowledge that KocDigital LLM might generate incorrect or deceptive information. Results should be viewed as initial prompts or recommendations rather than absolute conclusions.

Ethical Concerns and Potential Risks

Risk of Misuse: KocDigital LLM carries the potential for generating language that could be offensive or harmful. We strongly advise against its utilization for such purposes and stress the importance of conducting thorough safety and fairness assessments tailored to specific applications before implementation.
Unintended Biases and Content: The model underwent training on a vast corpus of text data without explicit vetting for offensive material or inherent biases. Consequently, it may inadvertently generate content reflecting these biases or inaccuracies.
Toxicity: Despite efforts to curate appropriate training data, the model has the capacity to produce harmful content, particularly when prompted explicitly. We encourage active participation from the open-source community to devise strategies aimed at mitigating such risks.

Guidelines for Secure and Ethical Utilization

Human Oversight: We advocate for the integration of a human oversight mechanism or the utilization of filters to oversee and enhance the quality of outputs, particularly in applications accessible to the public. This strategy can assist in minimizing the likelihood of unexpectedly generating objectionable content.
Tailored Testing for Specific Applications: Developers planning to utilize KocDigital LLM should execute comprehensive safety assessments and optimizations customized to their unique applications. This step is essential as the model's responses may exhibit unpredictability and occasional biases, inaccuracies, or offensive outputs.
Responsible Development and Deployment: Developers and users of KocDigital LLM bear the responsibility for ensuring its ethical and secure application. We encourage users to be cognizant of the model's limitations and to implement appropriate measures to prevent misuse or adverse outcomes.

KOCDIGITAL
/

Kocdigital-LLM-8b-v0.1