Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

748

Understanding raw result data files

#729

by jerome-white - opened 17 days ago

Discussion

jerome-white

17 days ago

I'm interested in parsing the detailed results files for each submission on the leaderboard (open-llm-leaderboard/details_*). It looks like each benchmark has it's own format -- are there:

Open source parsers for any of them? I've been rolling my own, but if there's something I can lean on that'd be better.
Documentation on data semantics (what does each key mean?). Some key-values seem straightforward, but it'd be nice to get an authoritative answer.

SaylorTwift

Hugging Face H4 org 12 days ago

Hi ! you can find a way to download the details using the hub api. We do not have an official way to do it in bulk as we do not need this feature.
as for the meaning of the keys, we went through multiple iterations of this logging system therefore they might be a bit different. we usually keep however

For multi choice tasks
- the input of the model
- the different available choices
- the loglikelihood generated by the model for each of those.
- the tokenized version of each the input and choices
for generative tasks
- the input in text and tokenized form
- the generated answer in text and tokenized form
- the target

jerome-white

11 days ago

Thanks!

you can find a way to download the details using the hub api

I started by using the datasets API, but have found the hub API to be a lot more straightforward (and reliable)

we went through multiple iterations of this logging system therefore they might be a bit different

One difference I've noticed that I'm really trying to get right is how metrics are stored. I've noticed three variations:

As a dictionary in the "metrics" column in which keys are the metric name, and values are the corresponding result
As metric-named columns (so a column literally named "acc", for example, alongside other columns for prompts and what not)
As metric-named columns prefixed with "metric." (a column named "metric.acc" for example)

Are those the only variations I might find?

SaylorTwift

Hugging Face H4 org 5 days ago

Yeah it should be the only variations. the metrics dict will have different values inside and those can change. for example, the exact_match metric might sometimes be called em.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment