[Feedback welcome] Add evaluation results to model card metadata

#40

by Wauplin HF staff - opened Nov 29, 2023

base: refs/heads/main

←

from: refs/pr/40

Discussion Files changed

+161

-4

Wauplin

Hugging Face H4 org Nov 29, 2023

•

edited Nov 29, 2023

This is a work in progress. The goal is to list evaluation results in the model card metadata, especially the results from the Open LLM Leaderboard. This PR has not been created automatically.

Pending questions:

Should we report all metrics for each task? (especially the _stderr ones?) Or only the one that is displayed in the LLM Leaderboard?
Are the dataset type/name/config/split/num_few_shot accurate in the suggested changes?
How to report the MMLU results? There are 57 different hendrycksTest datasets for a total of 228 metrics? 😵
How to report MT-Bench results? (asking since they are reported in the model card but not in the metadata)
How to report AlpacaEval results? (asking since they are reported in the model card but not in the metadata)

Thanks to @clefourrier who guided me with the Open LLM Leaderboard results 🤗

cc @julien-c @lewtun @Weyaxi

[WIP] Add evaluation results to model card metadata65813307

Update README.mda14ed684

Update README.md72ac185e

Wauplin changed pull request title from [WIP] Add evaluation results to model card metadata to [Feedback welcome] Add evaluation results to model card metadata Nov 29, 2023

Weyaxi

Nov 29, 2023

Should we report all metrics for each task? (especially the _stderr ones?) Or only the one that is displayed in the LLM Leaderboard?

In my opinion, the one displayed on the LLM Leaderboard would be a better choice because people generally want to know those results. Also, that can confuse things a little. On the other hand, the other metrics can show a more detailed version of the results.

How to report the MMLU results? There are 57 different hendrycksTest datasets for a total of 228 metrics? 😵

Hmm, I think something like 'Overall MMLU' could work, but I'm not sure about that.

clefourrier

Hugging Face H4 org Nov 30, 2023

For the leaderboard, only one metric (the reported one as @Weyaxi suggested) should be enough, especially if you provide the hyperlink to the details
From a first look:
- ARC: OK
- HellaSwag: dataset = hellaswag, split = validation
- DROP: they actually apply a post process to the drop dataset in the harness but I think saying drop should be fine anyway, split = validation
- TruthfulQA: OK
- GSM8K: config = main
- MMLU: dataset = cais/mmlu, config = all of them (if you want to provide the list it's in the about of the leaderboard), split = test
- Winogrande: dataset = winogrande, config = winogrande_xl, split = validation
For MMLU, we report the average of all acc scores, so "Aggregated MMLU", with as metric "avg(acc)" for example - People wanting to get the detail should go read it themselves as it's just going to be overwhelming elsewise

Update README.md5ae48397

Update README.mda044a6c5

Wauplin

Hugging Face H4 org Nov 30, 2023

•

edited Nov 30, 2023

Thanks both for the feedback!

I pushed changes in 5ae48397:

only 1 metric per benchmark (keeping the one on the leaderboard as suggested)
add MMLU results => keep only 1 global result
add Winogrande (thanks @clefourrier for noticing it was missing :D)
corrected the few dataset/config/split that were not accurate.

Looks like we have a good final version now :)

lewtun

Hugging Face H4 org Nov 30, 2023

Thanks a lot for adding this clean evaluation index! I think for AlpacaEval we can point the type to https://huggingface.co/datasets/tatsu-lab/alpaca_eval

Apart from that, this LGTM 🔥

Update README.md3373c17a

Wauplin

Hugging Face H4 org Dec 1, 2023

Thanks everyone for the feedback! Let's merge this :)

lewtun changed pull request status to merged Dec 1, 2023

julien-c

Dec 12, 2023

great PR all!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment