macadeliccc (Tim Dolan)

posted an update about 1 month ago

Post

3382

Fine tune Phi-3 using samatha themed dataset and Huggingface SFT trainer!

In this colab, we simply apply a supervised finetune to phi-3 using the sharegpt format.

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    mapper = {"system": "system\n", "human": "\nuser\n", "gpt": "\nassistant\n"}
    end_mapper = {"system": "", "human": "", "gpt": ""}
    for convo in convos:
        text = "".join(f"{mapper[(turn := x['from'])]} {x['value']}\n{end_mapper[turn]}" for x in convo)
        texts.append(f"{text}{EOS_TOKEN}")  
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)
print(dataset['text'][8])

Opus Samantha consists of 1848 samples with the samantha personality. The dataset covers a wide variety of topics such as logical reasoning, mathematics, legal, and rp.

This notebook serves as a viable option to finetune Phi-3 until Unsloth supports phi-3, which should be very soon. When that happens check out AutoSloth for both SFT, DPO, and langfuse format RAG fine tuning on free tier colab hardware.

Resources:
Dataset: macadeliccc/opus_samantha
Colab: https://colab.research.google.com/drive/1e8LILflDQ2Me52hwS7uIfuJ9DxE2oQzM?usp=sharing
AutoSloth: https://colab.research.google.com/drive/1Zo0sVEb2lqdsUm9dy2PTzGySxdF9CNkc#scrollTo=bpimlPXVz-CZ

posted an update 3 months ago

Post

Quantize 7B paramater models in 60 seconds using Half Quadratic Quantization (HQQ).

This game-changing technique allows for rapid quantization of models like Llama-2-70B in under 5 minutes, outperforming traditional methods by 50x in speed and offering high-quality compression without calibration data.

Mobius Labs innovative approach not only significantly reduces memory requirements but also enables the use of large models on consumer-grade GPUs, paving the way for more accessible and efficient machine learning research.

Mobius Labs' method utilizes a robust optimization formulation to determine the optimal quantization parameters, specifically targeting the minimization of errors between original and dequantized weights. This involves employing a loss function that promotes sparsity and utilizes a non-convex lp<1-norm, making the problem challenging yet solvable through a Half-Quadratic solver.

This solver simplifies the problem by introducing an extra variable and dividing the optimization into manageable sub-problems. Their implementation cleverly fixes the scale parameter to simplify calculations and focuses on optimizing the zero-point, utilizing closed-form solutions for each sub-problem to bypass the need for gradient calculations.

Check out the colab demo where you are able to quantize models (text generation and multimodal) for use with vLLM or Timm backend as well as transformers!

AutoHQQ: 👉 https://colab.research.google.com/drive/1cG_5R_u9q53Uond7F0JEdliwvoeeaXVN?usp=sharing
Code: https://github.com/mobiusml/hqq
HQQ Blog post: https://mobiusml.github.io/hqq_blog/

Edit: Here is an example of how powerful HQQ can be: macadeliccc/Nous-Hermes-2-Mixtral-8x7B-DPO-HQQ

Citations:

@misc {badri2023hqq,
title = {Half-Quadratic Quantization of Large Machine Learning Models},
url = {https://mobiusml.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month = {November},
year = {2023}
}

replied to their post 3 months ago

thanks for letting me know. I have updated the post with the correct link https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing

posted an update 3 months ago

Post

Create synthetic instruction datasets using open source LLM's and bonito🐟!

With Bonito, you can generate synthetic datasets for a wide variety of supported tasks.

The Bonito model introduces a novel approach for conditional task generation, transforming unannotated text into task-specific training datasets to facilitate zero-shot adaptation of large language models on specialized data.

This methodology not only improves the adaptability of LLMs to new domains but also showcases the effectiveness of synthetic instruction tuning datasets in achieving substantial performance gains.

AutoBonito🐟: https://colab.research.google.com/drive/1l9zh_VX0X4ylbzpGckCjH5yEflFsLW04?usp=sharing
Original Repo: https://github.com/BatsResearch/bonito?tab=readme-ov-file
Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation (2402.18334)

2 replies

·

replied to their post 4 months ago

Thank you! 🤗

replied to vladbogo's post 4 months ago

Thank you! I will try it again and let you know if there is any issues

replied to vladbogo's post 4 months ago

Love the concept! I am not sure if you are just making requests to an API, but I gave it a simple paragraph and it ran for 3800 seconds with no response.

I was going to suggest more powerful hardware, but that may not be the answer.

replied to their post 4 months ago

Here it is on colab. I am cloning from a new branch that has greatly increased the generation time. This method will save you a lot of money, but will get merged into main branch soon. When that happens, update the git clone command and everything else will be the same. https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing

replied to their post 4 months ago

If you are interested in a notebook, do let me know. I have one but the whole experiment will cost at minimum 250 dollars. I do not know how much it will cost if you use the new vllm method. It seems to offer quite a few improvements.

posted an update 4 months ago

Post

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

UCLA-AGI has proven that large language models, even weaker large language models, can improve themselves with data only produced by original model. The question they answer in their paper is:

"Can we empower a weak LLM to improve itself without acquiring additional human annotated data?"

They answer this question by the proposal and testing of a novel fine-tuning method they call Self-Play fIne-tuNing (SPIN). The process starts by applying a supervised fine-tune (SFT) to zephyr-7b using all 200k samples of HuggingfaceH4/ultrachat_200k to eliminate the need for a human annotator.

Once the model has completed SFT, the SPIN method suggests generating 50k samples of synthetic data pairs of 'chosen' and 'rejected' samples. The model will be fine tuned on those generations, and this process will repeat for another 3 iterations for a total 200k samples.

This experiment is unique because they propose that their method can yield upwards of 10% performance gains without using any additional human annotated data. The strategy was designed to improve the less strong language models, but with further experimentation could be a formidable strategy for improving language models.

If you would like to explore this strategy for yourself, here are some resources:
Colab: https://colab.research.google.com/drive/1IjDeNVBsRru2-hM_9aauD6gVI-VvJnpk?usp=sharing
Github: https://github.com/uclaml/SPIN
The product of the experiment: UCLA-AGI/zephyr-7b-sft-full-SPIN-iter3
Paper: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)

3 replies

·

replied to their post 4 months ago

You are correct. Unsloth does utilize peft and trl, but as I understand it they are not necessarily cross-compatible.

replied to their post 4 months ago

Thank you I will test this feature and integrate into the notebook

replied to their post 4 months ago

To my knowledge that is not currently supported. I have provided the training arguments that unsloth documents, but I will do some more research and testing to provide a more comprehensive answer.

Here's the unsloth docs that show the supported training arguments https://github.com/unslothai/unsloth/tree/main?tab=readme-ov-file#-documentation

posted an update 4 months ago

Post

Fine-tune 7B models on free-tier Colab hardware using Unsloth 🦥

Unsloth is a framework for fine tuning language models boasting a 0% loss in accuracy while using no approximation methods. They offer a trainer for both supervised fine-tuning (SFT) and direct preference optimization (DPO) that can increase speed of fine-tuning by up to 5x.
This is achieved by adding LoRa adapters. This way they only need to train 1 to 10% of the total parameters. You can export a LoRa adapter or merge to 16-bit for a full finetune. The resulting model is prepared for use in vLLM for faster inference.

Additionally, Huggingface has integrated Unsloth into the documentation for DPO training and reported 18.6% performance gains on T4.

This sets a new standard for fine-tuning large language models. If you would like to explore this methodology for yourself I have provided a notebook "AutoSloth," where you can fine tune using either SFT or DPO and it will upload to HF with a prefilled Unsloth README 🦥 and a Q8_0 quantization.

The SFT example is set up for free tier usage, but the DPO example is set up for an A100. The DPO example can be altered to work on T4 but I wanted to include more than one example.

Colab Stats during training:
+ Model: unsloth/mistral-7b-bnb-4bit
+ Dataset: yahma/alpaca-cleaned
+ Batch size: 2
+ Gradient steps: 4
+ System RAM: 8.5 / 51.0 GB
+ VRAM (T4): 13.6 / 15.0 GB

Resources:
🦥Unsloth: https://github.com/unslothai/unsloth
🦥AutoSloth: https://colab.research.google.com/drive/1Zo0sVEb2lqdsUm9dy2PTzGySxdF9CNkc?usp=sharing
🤗HF-Unsloth-docs: https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth
🤗HF-Unsloth Blog Post: https://huggingface.co/blog/unsloth-trl

9 replies

·

replied to their post 4 months ago

I tried a 7b on my 4090 and it used ~16.4GB. T4 is 16GB so it very well may not work.

You could definitely do some experimenting with a smaller model and a T4 though

replied to their post 4 months ago

No a 7b model should work on V100 as well. If you want to test a larger model then it’s likely you will need the A100

replied to their post 4 months ago

I believe that this is related to the particular laser method. rmt_laser_snr and laser_snr_math are functional. I will look into the errors and see if its related to my notebook.

posted an update 4 months ago

Post

Reducing perplexity in LLM's through layer selective rank reduction

Layer-Selective Rank Reduction (LASER) is a denoising method that improves reasoning by the strategic removal of higher-order components from weight matrices in the multi-layer perceptron (MLP) layers without the need for additional parameters or training data. This process leverages singular value decomposition to identify and eliminate these components. This simple, yet effective, method has shown to improve question-answering performance by up to 27.4 percentage points.

LaserRMT implements this through a process by calculating signal to noise ratio (SNR) for each layer and selectively reducing the rank of these layers.The SNR method meticulously computes the SNR by leveraging singular value decomposition (SVD) to separate the signal (higher-order components) from the noise (lower-order components) within the weight matrices of the model's layers. The SNR calculation is what determines which layers would benefit from rank reduction without compromising the models integrity.

If a layer is identified that could benefit from rank reduction, then the layer will enter an incremental process where the weight matrices are reduced and reconstructed by retaining only the singular values that surpass the threshold. In the case of laserRMT, the threshold is calculated by Marchenko-Pastur Law.

@staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
        return thr

The two primary benefits of applying this method are reducing computational overhead of large language models and simultaneously improving output quality.

Credit to @ehartford @fernandofernandes @DavidGF for laserRMT

Resources:
☄️ AutoLaser: https://colab.research.google.com/drive/11j0e-w6BfvqeFN1gUrpOqdW0vcKqfVqP?usp=sharing
laserRMT: https://github.com/cognitivecomputations/laserRMT
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558)

8 replies

·

posted an update 4 months ago

Post

Benefits of imatrix quantization in place of quip#

Quip-# is a quantization method proposed by [Cornell-RelaxML](https://github.com/Cornell-RelaxML) that claims tremendous performance gains using only 2-bit precision.

RelaxML proposes that quantizing a model from 16 bit to 2 bit precision they can utilize Llama-2-70B on a single 24GB GPU.

QuIP# aims to revolutionize model quantization through a blend of incoherence processing and advanced lattice codebooks. By switching to a Hadamard transform-based incoherence approach, QuIP# enhances GPU efficiency, making weight matrices more Gaussian-like and ideal for quantization with its improved lattice codebooks.

This new method has already seen some adoption by projects like llama.cpp. The use of the Quip-# methodology has been implemented in the form of imatrix calculations. The importance matrix is calculated from a dataset such as wiki.train.raw and will output the perplexity on the given dataset.

This interim step can improve the results of the quantized model. If you would like to explore this process for yourself:

llama.cpp - https://github.com/ggerganov/llama.cpp/
Quip# paper - https://cornell-relaxml.github.io/quip-sharp/
AutoQuip# colab - https://colab.research.google.com/drive/1rPDvcticCekw8VPNjDbh_UcivVBzgwEW?usp=sharing

Other impressive quantization projects to watch:
+ AQLM
https://github.com/Vahe1994/AQLM
https://arxiv.org/abs/2401.06118

Tim Dolan

AI & ML interests

Organizations

macadeliccc's activity