Re-evaluate GPT-4 ! Add a ELO-graph over time to the leaderboard

#19
by cmp-nct - opened

In the past 4 weeks GPT-4 has degraded significantly, countless people have noticed it.
I've been using it for code refactoring a lot, it became extremely bad.
It can not follow instructions anymore, deviates from them again and again.

For those non-frozen models you need to have a regular re-evaluation.

A ELO-graph that shows user votes/rankings per day would also allow to show the change of quality in models, or the change of how people perceive a models quality over time.

There was a new release two days ago that could be added to the arena gpt-4-0125-preview .

This model completes tasks like code generation more thoroughly than the previous preview model and is intended to reduce cases of “laziness” where the model doesn’t complete a task.

https://openai.com/blog/new-embedding-models-and-api-updates

During my first tests I found it as incapable as the previous one.
It's far worse from where GPT-4 was 2 months ago.

I'm not sure if it is better in not being lazy, the previous GPT-4-"turbo" was very adamant in "// fill all code here" type comments, even if you told it not to do it many times.
The answers I got from it, so far, were not great.

Also in the lmsys leadboard, it appears that both "turbo" variants have been mixed together now.
It would have been an opportunity to restart fresh.

Same here, have been facing issues with GPT-4 when it comes to coding related tasks. It has degraded significantly and my experience with the new model that was released is the same.

I know there are a bunch of Elo variants, but never learned the exact differences. Here is one summary:

To make a long story super short, Elo is the grandfather of most like systems. It has been around for ages and is super simple. Elo doesn’t care about inactivity or inconsistencies. The process starts from day 1 and moves chronologically throughout time, every competitor starts with a starter rating, which is then modified with each result. Glicko-1 is a very similar system to Elo, except it has the concept of “rating deviation” which allows competitors’ ratings to deviate more or less, given when they fought last. There is also a second version of Glicko, which tosses in a factor called volatility — it is a major complication with extremely limited benefit.

In comes WHR. Again, it is based on Elo, but is setup to take numerous passes throughout history. With each pass, it “learns” from what happened in surrounding events. This makes it an excellent system for reviewing the past and in trying to determine when a competitor was really at their peak. Whether it paints a more accurate ranking picture… who knows?

I know Glicko has a measure of uncertainty built-in, not sure how that compares to lmsys' bootstrap method.

Maybe WHR would be a better choice? I know WHR is used to track rock climber skill over time, for instance. From their own paper, they say:

Experiments demonstrate that, in comparison to Elo, Glicko, TrueSkill, and decayed-history algorithms, WHR produces better predictions.

WHR can show how models change in skill over time, and how confident we can be in the measurement:

image.png

Does anyone know (or want to research) how to choose w² values for the Whole History Ratings metric?

Also how to apply ties.

https://www.jstor.org/stable/2283595

The usual practice is either to force a definite expression of preference, or to treat tries when they occur by ignoring, splitting or randomly allocating them.

https://www.jstor.org/stable/2282923

In such cases, the usual practice is either to force the judge to express a definite preference, or, if this is not done, to treat these ties in one of the following ways: (a) they are completely ignored; (b) they are divided equally between the tied members; or, (c) they are divided randomly between the tied members. Several research workers have shown that the method of ignoring the tied observations is the best procedure (in terms of power) for experiments of this type.

Updated for March and tried to fix the anchor to match the Leaderboard (average of mixtral-8x7b-instruct-v0.1 = 1114) but it's still wider for some reason (probably from not including ties?):

WHR Rating Over Time w^2=5 pre=10.png

Sign up or log in to comment