NousResearch/finetuning_subnet_leaderboard · Does head-to-head loss on the evaluation set measure capabilities or merely style?

Jan 30

Hello Nous folks - congrats on launching this novel benchmark!

Do I understand correctly that you compute each model's loss on the (dynamic) evaluation set? If so, can you share any insights on how this “ensures a fair and accurate consensus on which model most closely mirrors GPT-4's performance”?

The reason I ask is that it's generally understood that cross-entropy will only tell you how well a given LLM can mimic the GPT-4 outputs, but doesn't translate directly into a measure of capabilities across e.g. reasoning, code, math etc. For an example of what I'm referring to, see this paper which explores some cases where LLama-13b fine-tuned on gpt-3.5 outputs can produce similar responses, but with more mistakes than the original model.

Thanks!

teknium

NousResearch org Feb 3

Hello Nous folks - congrats on launching this novel benchmark!

Do I understand correctly that you compute each model's loss on the (dynamic) evaluation set? If so, can you share any insights on how this “ensures a fair and accurate consensus on which model most closely mirrors GPT-4's performance”?

The reason I ask is that it's generally understood that cross-entropy will only tell you how well a given LLM can mimic the GPT-4 outputs, but doesn't translate directly into a measure of capabilities across e.g. reasoning, code, math etc. For an example of what I'm referring to, see this paper which explores some cases where LLama-13b fine-tuned on gpt-3.5 outputs can produce similar responses, but with more mistakes than the original model.

Thanks!

There is a wealth of evidence that contradicts the opinion that there is nothing more than style cloning going on here, the paper you linked is wrong - please see papers like Orca, CodeLlama, Phi, etc to see just a few examples of finetuning (even if at the continued pretraining scale) - or look at the huggingface leaderboard or Hermes, to see that you can improve reasoning, knowledge, task accuracy, and everything else. A 0 loss would equate to having the exact same accuracy as GPT-4

teknium changed discussion status to closed Feb 3