arxiv:2405.01535

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Published on May 2

· Featured in Daily Papers on May 3

Upvote

Authors:

Seungone Kim ,

Juyoung Suk ,

Shayne Longpre ,

Bill Yuchen Lin ,

Sean Welleck ,

Graham Neubig ,

Minjoon Seo

Abstract

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

View arXiv page View PDF Add to collection

Community

AdinaY

16 days ago

Really cool work🔥 Would be great to upload the model or build a demo on the hub!

seungone

Paper author 16 days ago

@AdinaY Thanks for your interest in our paper!

You could access the models here:
https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0
https://huggingface.co/prometheus-eval/prometheus-7b-v2.0

Here's the github repo where we prepared (possibly) every functionality you might need:
https://github.com/prometheus-eval/prometheus-eval

librarian-bot

15 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

julien-c

9 days ago

@librarian-bot recommend

mikelabs

12 days ago

There's a plain-english rewrite of this paper available here: https://www.aimodels.fyi/papers/arxiv/prometheus-2-open-source-language-model-specialized

alvarobartt

9 days ago

Hi here @seungone et al! Congrats on the paper and the release 🎉

I was just wondering whether you guys did experiment with multi-prompt settings to e.g. critique the last assistant response/s, while using a conversation as input instead of an instruction.

Plus also the fact that some responses to a given instruction can be conditioned by the system prompt, whether you did consider adding a system prompt to the template or if you did run some ablations on that too.

Thanks in advance!

seungone

Paper author 8 days ago

Hey @alvarobartt , thanks for your interest!

We did experiments using MT-Bench that is a multi-turn chat-based benchmark.
All you have to do is append the whole interaction at the {instruction} and insert the latest response in {response} from the template.

Also, we appended the system prompt to the {instruction} placeholder as well. Please let us know your experiences after using Prometheus 2:)