Analysis on evaluating 7 bilions italian LLMs

Community Article Published April 10, 2024

I and Samuele Colombo are the maintainer of the Italian Leaderboard and We have contribute to lm-evaluation-harness adding different evaluation tasks to different languages, mainly for our interest in Italian LLMs as in this PR. As part of the releases above we have evaluated many different Italian open source models on different tasks. And from all our experimentation we have collected many datapoint and we conducted a simple explorative analysis. In this article we will share the data and also some interesting findings.

List of metrics used in the evals:

  • HellaSwag: Evaluates how well an LLM can complete a sentence https://rowanzellers.com/hellaswag/
  • MMLU: MMLU Massive Multitask Language Understanding evaluates how well the LLM can multitask https://github.com/hendrycks/test
  • ARC-c: is an acronym for AI2 Reasoning Challenge — Challenge. It is a subset of the ARC dataset, which is a large-scale collection of multiple-choice questions that require reasoning and commonsense knowledge to answer.
  • Belebele: is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset
  • Lambada: LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) is a benchmark whose task is very similar to language modeling. The assignment is to recover a missing word from a portion of text, where the missing word is always the last word of its sentence.
  • Xcopa: Cross-lingual Choice of Plausible Alternatives (XCOPA), a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. Command used to reproduce the data:

zero few shot

lm_eval --model hf --model_args pretrained=YOURHUGGINGFACEMODEL --tasks xcopa_it,hellaswag_it,lambada_openai_mt_it,belebele_ita_Latn, arc_it  --device cuda:0 --batch_size 8

only mmlu_it with 5 shot

lm_eval --model hf --model_args pretrained=YOURHUGGINGFACEMODEL --tasks mmlu_it --num_shot 5 --device cuda:0 --batch_size 8

Data Analysis

The data can be seen and downloaded from this gsheet and the simple analyses and visualizations have been produced by this colab.

model m_mmul acc shot 5 belebele_ita_Latn acc belebele_ita_Latn acc norm helloswag_it acc helloswag_it acc norm lambada_openai_mt_it perplexity lambada_openai_mt_it acc xcopa_it acc arc_it acc arc_it acc norm Average
giux78/zefiro-7b-sft-qlora-ITA-v0.5 0.5246 0.4656 0.4656 0.4636 0.6097 22.5232 0.5154 0.67 0.1642 0.4397 0.4798222222
mii-llm/maestrale-chat-v0.2-alpha 0.5163 0.4678 0.4678 0.519 0.6852 26.0037 0.4987 0.722 0.1206 0.4585 0.4951
FinancialSupport/saiga-7b 0.4933 0.5222 0.5222 0.4824 0.6342 30.2369 0.4671 0.672 0.16 0.4748 0.4920222222
giux78/zefiro-7b-beta-ITA-v0.1 0.5203 0.45 0.45 0.4607 0.6129 25.8213 0.5013 0.666 0.0838 0.4294 0.4638222222
raicritis/Hermes7b_ITA 0.3574 0.3689 0.3689 0.4112 0.5407 34.7106 0.4677 0.66 0.1249 0.3524 0.4057888889
DeepMount/Mistral-Ita-7b 0.3879 0.38 0.38 0.3978 0.5123 89.99 0.3361 0.592 0 0.3747 0.3872444444
galatolo/cerbero-7B 0.5137 0.5089 0.5089 0.4722 0.6135 23.4551 0.4964 0.672 0.1001 0.4465 0.4813555556
mii-11m/maestrale-chat-v0.3-alpha 0.5164 0.5911 0.5911 0.5046 0.66 38.2427 0.4378 0.692 0.1343 0.4568 0.5093444444
giux78/zefiro-7b-dpo-qlora-ITA-v0.7 0.5203 0.4778 0.4778 0.4914 0.6428 23.6041 0.5174 0.684 0.1805 0.4611 0.4947888889
mii-llm/maestrale-chat-v0.3-beta 0.5129 0.5644 0.5644 0.5067 0.6581 53.0646 0.4207 0.72 0.1463 0.4559 0.5054888889
swap-uniba/LLaMAntino-2-7b-hf-ITA 0.3696 0.2433 0.2433 0.4113 0.5428 33.6146 0.4696 0.678 0.139 0.3456 0.3825
mistralai/Mistral-7B-v0.1 0.5253 0.41 0.41 0.4486 0.6122 30.2635 0.4894 0.658 0.1061 0.4149 0.4527222222

Models rank

In the below chart the models are ordered by the average on all evaluation metrics but no perplexity:

image/png

In the below chart the models ordered by the average of all evaluation metrics with perplexity normalized and calculated as

perplexity = (100 - perp) / 100:

image/png

Ranks on single metrics

MMLU_IT few shot 5 is never improved

This is a very interesting findings no models improve mmlu_it compared to mistral-7B-v0.1 base model. Seems all models fine tuned with different strategies: continual pre-training, SFT or DPO are capable to improve on such metric. I suspect that deep knowledge on specific tasks is forgotten when the model is updated on a specific language knowledge. Is like more broad language knowledge is added the less specific capable it will become.

image/png

Beleleble

Maestrale works very well on this task it is interesting that Saiga-7B a merged model works well on this task.

image/png

Hellaswag

The maestrale series has very strong performance on hellaswag going close to LLAMA-70B a 10x model that has about 70% accuracy and mixtral-7x8 who has about 75% on this task.

image/png

Lambada Perplexity

Perplexity measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is. Interesting that SFT models seems to work better than DPO, and also seeing from maestrale versions that the more STF the worst perplexity.

image/png

Lambada openai

The trilogy of zefiro performs well on this benchmark

image/png

XCopa

Maestrale series is very strong on that evaluation.

image/png

Arc-c Saiga-7b, a merged model on this task is the best model. Can be valuable trying to understand why on that metric is so strong compared to mistral-7b

image/png

Below all models compared on all tasks

image/png

The zefiro trilogy

The trilogy of zefiro has been described in deep training strategies and datasets on this (article)[https://medium.com/@giuxale/the-trilogy-of-zefiro-f0cd6f95ef94]. As in described above mmlu_it and in the chart below mmlu_it is the only metrics not improved by applying continual pretraing, sft and also dpo. In the majority of the other evaluations every training strategy (continual, SFT, DPO) seems to improve the LLM capabilities. On average respect to the base mistral-7B-v0.1 model there is a gain of about 5% from 45% to 50%. In my opinion it is a good indicator that models can be improved on language specific tasks. I suspect that with more training on more data is possible to improve a lot more. Another interesting findings is that in many metrics from continual pre-training to dpo there is a continuous improving.

image/png

Maestrale series

Maestrale is a series of very strong LLM models on Italian tasks and one of the best in many metrics. The data suggests that from version 0.3-alpha to 0.3-beta in some case there has been a small degradation of performance, that can be interesting to understand and discuss. In particular on the perplexity metrics. In any case it performs very well as the best LLM on many evals and on average.

image/png

Conclusion

Fine tuned open source models trained on sigle GPUs are been able to improve by 5% foundational base model on average on the evaluation tasks taken in consideration in the article. With more training time on more data this number can be improved a lot. Already on important metrics as arc-c and hellaswag 7 billion model specialized on a specific language can be very close to 10x bigger open source model as LLama-70-b and mixtral-7x8. MMUL is a difficult metric to improve. The evaluation will become a central part of the LLM ecosystem. LLMs can be specialized in different directions also contemporary. Many more specialized evaluations dataset and benchmark will born. Every LLMs can have peculiarities to be discovered.