lmsys/chatbot-arena-leaderboard · How does GPT-4 Turbo do so well?

Dec 27, 2023

•

edited Dec 27, 2023

This leaderboard is a blind head-to-head comparison, so it should be fair and unbiased, and better than any benchmark that can be learned. GPT-4 Turbo scores much higher than any other model, yet there is broad consensus that GPT-4-Turbo is in practice a significant downgrade from the real GPT-4 (-0613). It no longer follows Custom Instructions, is more vague, more "lazy", less helpful, etc.

I guess the prompts people are entering in the arena aren't representative of the kind of prompts we're using in the actual chat interface? Or is it the chat interface (and context) itself that causes the problems, and the API version is genuinely better?

masharpe

Jan 4

Just a guess: the most difficult, most elaborate tasks, the kind that push GPT-4 to its limits, are not very appealing to try in the arena, because you'll likely get two bad answers (from sub-GPT-4 models). It might be that those are the tasks that gpt-4-turbo has trouble with, while doing better at medium difficulty tasks (perhaps due to more months of RLHF).

endolith

Jan 11

@masharpe Yeah that's the best explanation I can come up with, too. It's better than other models at responding to a single question, but not at more complex tasks, and people aren't testing those in the lmsys Arena because long threads diverge and inputting the same prompt to both makes less and less sense the longer the thread is.

cmp-nct

Jan 24

I did not see GPT-4-Turbo as a downgrade, it was very capable for a long time.
It (as well as GPT-4) significantly downgraded over the past 4-5 weeks, OpenAI either has some safety catastrophe implemented or they devalue their models in preparation to GPT-5 release.
They did the same, just without damaging frozen APIs, when they released Turbo 3.5

In any case, the ELO is worthless if it's not re-evaluated regularly.

endolith

Jan 24

I did not see GPT-4-Turbo as a downgrade, it was very capable for a long time.

I had Custom Instructions for months before the update, and with the release of GPT-4 Turbo it immediately stopped following them. I had Custom Instructions set so that it would repeatedly remind itself of the overall conversation history as a sort of long-term memory, and it stopped making those sections as soon as GPT-4 Turbo was turned on. I also had instructions to avoid giving disclaimers, refusing to answer questions, saying "as an AI language model", etc. and it got worse at following all of those.

In any case, the ELO is worthless if it's not re-evaluated regularly.

What do you mean?

cmp-nct

Jan 25

I did not see GPT-4-Turbo as a downgrade, it was very capable for a long time.

I had Custom Instructions for months before the update, and with the release of GPT-4 Turbo it immediately stopped following them. I had Custom Instructions set so that it would repeatedly remind itself of the overall conversation history as a sort of long-term memory, and it stopped making those sections as soon as GPT-4 Turbo was turned on. I also had instructions to avoid giving disclaimers, refusing to answer questions, saying "as an AI language model", etc. and it got worse at following all of those.

Yes GPT-4-turbo behaved different but it had the same capabilities, the same sort of intelligence as normal GPT-4.
On the other side: GPT-3.5 turbo is a full retard brother of GPT-3 (they banned that model to hide that GPT-3 is almost equal to GPT-4)

In any case, the ELO is worthless if it's not re-evaluated regularly.

What do you mean?

Roughly a month ago they degraded all GPT-4 models, GPT-4 (with and without turbo) is now just a notch better than GPT-3.5, I'd guess it's below original GPT-3.
So the ELO rating of lmsys is not accurate anymore, it's based on GPT-4 a month ago.

endolith

Jan 25

So the ELO rating of lmsys is not accurate anymore, it's based on GPT-4 a month ago.

I don't think that's correct; it's constantly being updated as people enter queries in the Arena. That's how it can compare with Gemini which was just released, for instance.

cmp-nct

Jan 25

•

edited Jan 25

So the ELO rating of lmsys is not accurate anymore, it's based on GPT-4 a month ago.

I don't think that's correct; it's constantly being updated as people enter queries in the Arena. That's how it can compare with Gemini which was just released, for instance.

No, Gemini was rated against all models and they calculated an ELO for that.
The GPT-4 ratings certainly can change but right now we have 60000 or so rating votes on that model, and 50000 of those votes were based on the higher capable GPT4 versions.

We need to see ELO over time graphs, that would allow to quickly see model degradations visually even if their summed-up ELO is still high.
Also maybe something like a running-average-bracket that adds more value to new ratings than to older ones

endolith

Jan 25

•

edited Jan 25

No, Gemini was rated against all models and they calculated an ELO for that.

… and adjusted the ratings for the other models at the same time

and 50000 of those votes were based on the higher capable GPT4 versions.

And those are judged independently, showing that GPT-4-0613 was a downgrade and then GPT-4-Turbo did better, for instance.

We need to see ELO over time graphs, that would allow to quickly see model degradations visually even if their summed-up ELO is still high.

Yes it would be good to see scores of each API model over time to see if they change performance despite not changing model number. Is all the data published for that? Wouldn't be necessary for fixed local models, though, which should always have deterministic performance.

Also maybe something like a running-average-bracket that adds more value to new ratings than to older ones

There are a lot of variations of Elo calculations, such as Glicko, WHR, etc. I'm not sure exactly what lmsys is using, but the code is here: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/monitor/elo_analysis.py

There's another calculation in this notebook: https://colab.research.google.com/drive/17L9uCiAivzWfzOxo2Tb9RMauT7vS6nVU#scrollTo=QLGc6DwxyvQc

endolith

Feb 2

Is the latest battle data released anywhere so other metrics can be run on it? I see https://drive.google.com/uc?id=1gjs-APnGZjw8vmN5pwykV2SukS0KYE-z from the notebook, but that's clean_battle_20230522.json from last year.

https://huggingface.co/datasets/lmsys/lmsys-chat-1m is from August 2023.

https://huggingface.co/datasets/lmsys/chatbot_arena_conversations is from June 2023.

endolith

Feb 3

Confirmed that the vast majority of conversations have only one turn: