Qian Liu

SivilTaram

AI & ML interests

Cooking cool things

Articles

Organizations

Posts 2

view post
Post
983
โœจ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. ๐Ÿš€

๐Ÿ’ปCode: https://github.com/sail-sg/sailcraft
๐Ÿค—Model: sail/sailor-language-models-65e19a749f978976f1959825
๐Ÿ“œPaper: Sailor: Open Language Models for South-East Asia (2404.03608)
๐ŸŒHomepage: https://sailorllm.github.io

# Overview ๐Ÿ”

The pipeline consists of 4 stages๐Ÿงน:
1๏ธโƒฃ Initial data cleaning
2๏ธโƒฃ Near deduplication
3๏ธโƒฃ Exact deduplication
4๏ธโƒฃ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages๐ŸŒ

# Use Case โœจ

With this codebase, you can clean your own dataset with:

โœ… Get filtered data counts after each processing stage
โœ… Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
โœ… Investigate what data was removed at each processing stage

# Acknowledgement ๐Ÿ™

The main credit goes to @dreamerdeo , the first author of our Sailor paper โค๏ธ! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. ๐ŸŽ‰

Sharing the recipe openly aligns with our commitment to open language model development. ๐Ÿ’ช And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. ๐Ÿง 

# What's Next ๐Ÿš€

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. ๐Ÿš„
view post
Post
2356
โš“๏ธ Sailor: A New Multilingual Open LLM for South-East Asia ๐ŸŒ

Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.

Today, we're more than excited to share the key technical details behind the Sailor models! ๐Ÿ’ช

**Key highlights**:
๐Ÿ” Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication.
๐Ÿค– Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations.
๐Ÿ” Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages!
๐ŸŒŸ Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.

We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! ๐ŸŒโœจ

To learn more, please access our research paper or reach out to our team.
๐Ÿ”— Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
๐Ÿงฉ Model: sail/sailor-language-models-65e19a749f978976f1959825
๐Ÿ’ป Code: https://github.com/sail-sg/sailor-llm