m-ric (Aymeric Roucher)

Posts 10

Post

2585

💰❌ 𝐑𝐞𝐬𝐞𝐚𝐫𝐜𝐡 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐯𝐞𝐫𝐲 𝐆𝐏𝐔 𝐏𝐨𝐨𝐫 - 𝐒𝐜𝐚𝐥𝐢𝐧𝐠 𝐥𝐚𝐰𝐬 𝐫𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧

🎆 Good news: 𝘆𝗼𝘂 𝗰𝗮𝗻 𝗱𝗼 𝗰𝘂𝘁𝘁𝗶𝗻𝗴-𝗲𝗱𝗴𝗲 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝘄𝗶𝘁𝗵 𝗮 𝗰𝗮𝗹𝗰𝘂𝗹𝗮𝘁𝗼𝗿 𝗮𝗻𝗱 𝗠𝗶𝗰𝗿𝗼𝘀𝗼𝗳𝘁 𝗣𝗮𝗶𝗻𝘁 𝟮𝟬𝟬𝟲!

The Chinchilla experiments (by Google DeepMind) ran hundreds of pre-trainings with models >1B parameters (I do not want to imagine how much that cost) to 𝗳𝗶𝗻𝗱 𝘁𝗵𝗲 𝗼𝗽𝘁𝗶𝗺𝗮𝗹 𝗿𝗮𝘁𝗶𝗼 𝗼𝗳 𝗺𝗼𝗱𝗲𝗹 𝘀𝗶𝘇𝗲 𝘃𝘀 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝘁𝗼𝗸𝗲𝗻𝘀. Why is this question so important?
Well, you only ever have access to a fixed compute, counted in FLOPs (floating point operations). So if your model is bigger, you will have less compute to train on many tokens, and if you want to train on more tokens, your model will be smaller. When model trainings cost million, you absolutely need to get this right.

The new paper "Chinchilla Scaling: A replication attempt" by Epoch AI sets on on the ambitious goal of reproducing this.

But since the authors do not have infinite money, they decided to directly run their computations from DeepMind's own experiments! They took the figure from the last experiment (cf slide below), measured point positions, picked color codes, and ended up reconstructing the underlying data.

💥 They then just fit the scaling laws proposed by the Chinchilla Authors, but arrived at wildly different results! They find that as a rough rule of thumb, you should use 20 training tokens for each parameter in your model, instead of the 70 obtained in the original paper. They also point out inconsistencies in the paper, and unrealistically narrow confidence intervals.

➡️ This only contradicts the results from the last (out of 3) experiments in the Chinchilla paper. And the model trained at the end of the Chinchilla paper still seems properly scaled.

✅ But it does show that a tiny bit more theoretical work can go a long way, especially given the huge financial costs that such an error can have!

Post

2356

𝐏𝐚𝐩𝐞𝐫 𝐑𝐞𝐯𝐢𝐞𝐰: 𝐑𝐡𝐨-𝟏 - 𝐃𝐨 𝐧𝐨𝐭 𝐮𝐬𝐞 𝐚𝐥𝐥 𝐭𝐨𝐤𝐞𝐧𝐬 𝐞𝐪𝐮𝐚𝐥𝐥𝐲 𝐢𝐧 𝐲𝐨𝐮𝐫 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠! ⚖️⛔️

A new paper topping Daily papers questions a hidden assumption in LLM training:

🤔 𝙎𝙝𝙤𝙪𝙡𝙙 𝙬𝙚 𝙧𝙚𝙖𝙡𝙡𝙮 𝙪𝙨𝙚 𝙖𝙡𝙡 𝙩𝙤𝙠𝙚𝙣𝙨 𝙚𝙦𝙪𝙖𝙡𝙡𝙮 𝙞𝙣 𝙤𝙪𝙧 𝙇𝙇𝙈'𝙨 𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 ?

Some tokens are more relevant than others, and some are mostly noise (just look up the history of 𝘚𝘰𝘭𝘪𝘥𝘎𝘰𝘭𝘥𝘔𝘢𝘨𝘪𝘬𝘢𝘳𝘱).

So this paper introduces 𝗦𝗲𝗹𝗲𝗰𝘁𝗶𝘃𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝗶𝗻𝗴, which is actually really simple:
➡️ A specific metric measures the relevance of each token. Then during training, only the top k% tokens for this relevance metric count in the loss calculation.

Authors test this method by training models on the difficult MATH dataset (only competition mathematics problems).

➡️ Their technique seems like a new must-do in LLM training: Training is much faster and reaches an impressive performance!

𝐑𝐞𝐬𝐮𝐥𝐭𝐬:
◆ ⏱️ Training is x5 to x10 faster to reach equivalent performance compared to standard language modeling.
◆ 💪 Their 1B model achieves close to GPT4 Chain-of-Thought performance on MATH!
◆ 🚀 Their 7B model match performance of the state-of-the-art DeepSeek for the same size, while trained on only 3% of tokens

𝐀𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬 💡
◆ Datasets used for pre-training, even after pre-filtering, still contain a large proportion of noisy tokens 😖
◆ Authors show that when you reduce loss on noisy tokens, you actually reduce accuracy (Figure 7). So Selective Language Modeling seems fundamental! ✅

Find great reads in @akhaliq 's Daily Papers 👉 https://huggingface.co/papers
Paper added to my collection 👉 m-ric/spinning-up-in-llms-659e698f9dd5a71bd3f579a7

View all posts