Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face šŸ¤— LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 10

view post
Post
2585
šŸ’°āŒ š‘šžš¬šžššš«šœš” šŸšØš« š­š”šž šÆšžš«š² š†šš” ššØšØš« - š’šœššš„š¢š§š  š„ššš°š¬ š«šžš©š„š¢šœššš­š¢šØš§

šŸŽ† Good news: š˜†š—¼š˜‚ š—°š—®š—» š—±š—¼ š—°š˜‚š˜š˜š—¶š—»š—“-š—²š—±š—“š—² š—暝—²š˜€š—²š—®š—暝—°š—µ š˜„š—¶š˜š—µ š—® š—°š—®š—¹š—°š˜‚š—¹š—®š˜š—¼š—æ š—®š—»š—± š— š—¶š—°š—暝—¼š˜€š—¼š—³š˜ š—£š—®š—¶š—»š˜ šŸ®šŸ¬šŸ¬šŸ²!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to š—³š—¶š—»š—± š˜š—µš—² š—¼š—½š˜š—¶š—ŗš—®š—¹ š—暝—®š˜š—¶š—¼ š—¼š—³ š—ŗš—¼š—±š—²š—¹ š˜€š—¶š˜‡š—² š˜ƒš˜€ š˜š—暝—®š—¶š—»š—¶š—»š—“ š˜š—¼š—øš—²š—»š˜€. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

šŸ’„ They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

āž”ļø This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

āœ… But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!
view post
Post
2356
šššš©šžš« š‘šžšÆš¢šžš°: š‘š”šØ-šŸ - šƒšØ š§šØš­ š®š¬šž ššš„š„ š­šØš¤šžš§š¬ šžšŖš®ššš„š„š² š¢š§ š²šØš®š« š­š«ššš¢š§š¢š§š ! āš–ļøā›”ļø

A new paper topping Daily papers questions a hidden assumption in LLM training:

šŸ¤” š™Žš™š™¤š™Ŗš™”š™™ š™¬š™š š™§š™šš™–š™”š™”š™® š™Ŗš™Øš™š š™–š™”š™” š™©š™¤š™ š™šš™£š™Ø š™šš™¦š™Ŗš™–š™”š™”š™® š™žš™£ š™¤š™Ŗš™§ š™‡š™‡š™ˆ'š™Ø š™©š™§š™–š™žš™£š™žš™£š™œ ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of š˜šš˜°š˜­š˜Ŗš˜„š˜Žš˜°š˜­š˜„š˜”š˜¢š˜Øš˜Ŗš˜¬š˜¢š˜³š˜±).

So this paper introduces š—¦š—²š—¹š—²š—°š˜š—¶š˜ƒš—² š—Ÿš—®š—»š—“š˜‚š—®š—“š—² š— š—¼š—±š—²š—¹š—¶š—»š—“, which is actually really simple:
āž”ļø A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

āž”ļø Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

š‘šžš¬š®š„š­š¬:
ā—† ā±ļø Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
ā—† šŸ’Ŗ Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
ā—† šŸš€ Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

š€ššš¢š­š¢šØš§ššš„ š¢š§š¬š¢š š”š­š¬ šŸ’”
ā—† Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens šŸ˜–
ā—† Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! āœ…

Find great reads in @akhaliq 's Daily Papers šŸ‘‰ https://huggingface.co/papers
Paper added to my collection šŸ‘‰ m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

models

None public yet