Alvaro Bartolome

alvarobartt

AI & ML interests

ml @argilla / open-source and machine learning

Articles

Organizations

Posts 4

view post
Post
1106
🔥 Prometheus 2 was recently released by Kaist AI as an alternative and closely mirroring both human and GPT-4 evaluation, and surpassing the former Prometheus!

prometheus-eval/prometheus-7b-v2.0
prometheus-eval/prometheus-8x7b-v2.0

🌬️Fine-tuned on top of mistralai/Mistral-7B-Instruct-v0.2 and mistralai/Mixtral-8x7B-Instruct-v0.1
🗂️The datasets used for fine-tuning have been publicly released i.e. prometheus-eval/Feedback-Collection and prometheus-eval/Preference-Collection
🤝🏻Unified LM evaluator for absolute (a single prompt-completion pair) and relative (two completions for a given prompt) due to model merging
❌No longer needs a mandatory reference / golden answer, but can still be provided optionally
🔝Surpasses the former version of Prometheus, and has a high correlation with human, GPT-4, and Claude 3 Opus scores when evaluating LMs
📝Apache 2.0 license

Long-story short, an amazing job from Kaist AI bridging the gap with LLM evaluators other than proprietary and bigger models!

This week at Argilla, we decided to add a new task to use Prometheus 2 as an LLM evaluator using distilabel, so we implemented PrometheusEval.

😱 Using PrometheusEval running their 7B variant with vLLM in a single L40 on top of HuggingFaceH4/instruction-dataset, we got the 327 existing prompt-completion pairs evaluated and pushed to the Hub in less than 2 minutes!

Find the generated dataset and the code at distilabel-internal-testing/instruction-dataset-prometheus
view post
Post
2090
🦫 We have just released argilla/Capybara-Preferences in collaboration with Kaist AI ( @JW17 , @nlee-208 ) and Hugging Face ( @lewtun )

A new synthetic preference dataset built using distilabel on top of the awesome LDJnr/Capybara from @LDJnr

The current dataset combines the already generated alternative completions from argilla/distilabel-capybara-dpo-7k-binarized, while also adding the remaining ones using the same approach!

Here are some key features on how we built it:

- 🧹 Duplicate removal, keeping the conversation besides the last assistant response, and some slight pre-processing

- 🤖 Generation of alternative completions for the existing conversations (last turn only) with: mlabonne/NeuralBeagle14-7B, argilla/notus-7b-v1, and teknium/OpenHermes-2.5-Mistral-7B

- 👨🏻‍🏫 Running UltraFeedback via GPT-4 to generate the critique i.e. ratings and rationales, for the last assistant responses

- 🎉 Finally, we selected the chosen and rejected responses based on their UltraFeedback score, and applied some slight post-processing!

Sounds simple right? Start building your own synthetic datasets with https://github.com/argilla-io/distilabel already!