Understanding raw result data files

#729
by jerome-white - opened

I'm interested in parsing the detailed results files for each submission on the leaderboard (open-llm-leaderboard/details_*). It looks like each benchmark has it's own format -- are there:

  1. Open source parsers for any of them? I've been rolling my own, but if there's something I can lean on that'd be better.
  2. Documentation on data semantics (what does each key mean?). Some key-values seem straightforward, but it'd be nice to get an authoritative answer.
Hugging Face H4 org

Hi ! you can find a way to download the details using the hub api. We do not have an official way to do it in bulk as we do not need this feature.
as for the meaning of the keys, we went through multiple iterations of this logging system therefore they might be a bit different. we usually keep however

  • For multi choice tasks
    • the input of the model
    • the different available choices
    • the loglikelihood generated by the model for each of those.
    • the tokenized version of each the input and choices
  • for generative tasks
    • the input in text and tokenized form
    • the generated answer in text and tokenized form
    • the target

Thanks!

you can find a way to download the details using the hub api

I started by using the datasets API, but have found the hub API to be a lot more straightforward (and reliable)

we went through multiple iterations of this logging system therefore they might be a bit different

One difference I've noticed that I'm really trying to get right is how metrics are stored. I've noticed three variations:

  1. As a dictionary in the "metrics" column in which keys are the metric name, and values are the corresponding result
  2. As metric-named columns (so a column literally named "acc", for example, alongside other columns for prompts and what not)
  3. As metric-named columns prefixed with "metric." (a column named "metric.acc" for example)

Are those the only variations I might find?

Hugging Face H4 org

Yeah it should be the only variations. the metrics dict will have different values inside and those can change. for example, the exact_match metric might sometimes be called em.

Sign up or log in to comment