Human level representation?

#8
by ehalit - opened

I know it is hard to do online but maybe we can have offline human-written responses to user queries. This way, we can see how models fare against human level intelligence.

Can they do research? Otherwise it's going to be a lot of "I don't know" on encyclopedic questions.

The "Both are bad" button could be counted as a win for human intelligence against both models. Otherwise, the "Both are bad" and "Tie" buttons have no effect on the Elo ranking, since it's based purely on pairwise defeats.

        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")

Wait, no, that's wrong. It makes their Elo scores more similar:

In the case of a tie game, the lower-ranked team gains the Elo points (albeit less than if they would have won!) while the higher-ranked team loses that exact amount.

But the "Both are bad" and "Tie" buttons currently have the same effect. They could be changed so that "Both are bad" is counted as a sort of loss against a hypothetical perfect AI, and the ELO score for the "Perfect AI" is also listed for comparison.

+1 on the above suggestion on having the tie (bothbad) having a different effect than a simple tie. Is there a reason this isn't the current design?

Well without the "hypothetical perfect AI" concept to compare to, there isn't anything else you can do with ties. I'm not sure why they have both a 'both are bad" and "tie" buttons, though, since they do the same thing.

Sign up or log in to comment