455 150 489

Daniel van Strien PRO

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 14

Extracting Insights from Model Cards Using Open Large Language Models

Nov 27, 2023

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Oct 30, 2023

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 5

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

Organizations

Posts 12

Post

653

Only 14 languages have DPO preference style datasets on the Hugging Face Hub ( DIBT/preference_data_by_language) Let's improve that! How?

The Cohere For AI Aya dataset CohereForAI/aya_dataset has human-annotated prompt-completion pairs in 71 languages. We can use this to create DPO datasets for more languages!

Using Aya's prompt/response pairs as a starting point we can use an LLM to generate an additional response to each prompt. We then use an LLM Judge to rank each response.

✅ In some/many languages, human responses may be better than LLM ones but we may want to check that assumption for some languages.
🚀 We use Argilla's distilabel library to push data to Argilla for validation. This also allows us to determine if an LLM judge is effective for different languages.

As an example of what this pipeline produces:
- DIBT/aya_dutch_dpo a DPO style dataset for Dutch using Llama 3 as a generator/judge LM.
- An annotation Space that anyone with a HF account can contribute to: https://dibt-demo-argilla-space.hf.space/dataset/924ef8a8-a447-4563-8806-0e2a668a5314/annotation-mode?page=1&status=pending

As part of Data is Better Together we want to build more DPO datasets. Join us here: https://github.com/huggingface/data-is-better-together#4-dpoorpo-datasets-for-more-languages 🤗

Post

957

Can you create domain-specific synthetic datasets in under 20 minutes?

@burtenshaw recently launched the Domain Specific Dataset Project as part of Data is Better Together. As part of this, Ben created a Space that you can use to define some key perspectives and concepts from a domain. This seed dataset can then be used to generate a synthetic dataset for a particular domain.

In less than 30 minutes this afternoon, I created a domain-specific dataset focused on data-centric machine learning using these tools: davanstrien/data-centric-ml-sft.

You can create your own domain specific datasets using this approach. Find the steps to follow here: https://github.com/huggingface/data-is-better-together/blob/main/domain-specific-datasets/README.md

View all posts