Which task is most similar to the average? An analysis via the Kendrall Tau coeffcient

#23
by varun4 - opened

The Kendall Tau coefficient gives us a natural way to measure which task is most similar to the average task in terms of the ranking it provides.

--------------------------------------------------------------------------------
Task                                               | Kendall Tau    
--------------------------------------------------------------------------------
Clustering Average (11 datasets)                   | 0.1692         
Pair Classification Average (3 datasets)           | 0.1654         
STS Average (10 datasets)                          | 0.1266         
Retrieval Average (15 datasets)                    | 0.1178         
Classification Average (12 datasets)               | -0.0050        
Reranking Average (4 datasets)                     | -0.0188        
Summarization Average (1 dataset)                  | -0.0426        
--------------------------------------------------------------------------------

The above result was for the unweighted average of each task, e.g. ignoring the contribution of the number of datasets for each task. Below is the weighted average, e.g. weighting by the number of datasets.

--------------------------------------------------------------------------------
Task                                               | Kendall Tau    
--------------------------------------------------------------------------------
Retrieval Average (15 datasets)                    | 0.1717         
Classification Average (12 datasets)               | 0.0815         
Clustering Average (11 datasets)                   | 0.0702         
STS Average (10 datasets)                          | 0.0376         
Pair Classification Average (3 datasets)           | 0.0113         
Reranking Average (4 datasets)                     | -0.0401        
Summarization Average (1 dataset)                  | -0.0990        
--------------------------------------------------------------------------------

This result is less helpful since the number of datasets per task becomes the primary component.

An intuitive takeaway from these results is that the Clustering Average Task and the Pair Classification Average tasks are good single-task heuristics for the MTEB average. In contrast, the tasks with a near-zero Kendall Tau coefficient (Classification Average, Reranking Average, and Summarization) have an "absence of association" to to the average task. In theory, this means these tasks are orthogonal in the evaluation space.

Massive Text Embedding Benchmark org

Very cool! The orthogonality of tasks is a good thing in my opinion, as it means that we are indeed measuring different skills. Many models still seem to struggle to master these different skills equally well.

Sign up or log in to comment