The LASER technique: Evaluating SVD compression

Community Article Published April 4, 2024

Introduction

What is Truncated Single Value Decomposition

Main thesis

Applying the LASER technique onto the top layers

Evaluation on a simple generation task

Evaluation on HumanEval

Conclusion

Resources

Acknoledgements

Citation

Introduction

A recent paper by Sharma et al showcases the effectiveness of truncated Single Value Decomposition (tSVD) in improving the results of various Large Language Models (LLMs). The authors named this technique LAyer SElective Rank reduction (LASER). In this short post, I want to apply tSVD onto Mistral-7B-instruct-v0.1 and record the accuracy and memory gains of the LASER technique for this very specific LLM.

Rather than just reproduce the results of the paper, this post aims to measure how many layers can be approximated with LASER and still obtain a model that is at least as performant as the original one. The code to reproduce the result can be found in this github repo.

What is Truncated Single Value Decomposition

The aim of tSVD is to reduce an arbitrary matrix into 3 linear operation: rotation + rescaling + another rotation. The rotation parts are achieved through linear transformations using unitary matrices, while rescaling employs a set of scalar values called singular values. By sorting the set of singular values in descending order and trucating it to the first q one achieves an approximate description of the original matrix. More formally, given an original $m \times n$ matrix $M$ then tSVD builds an approximate matrix

$\tilde{M} = U\,\Sigma\,V$

by calculating the $q \times n$ matrix $U$ , the $m \times q$ matrix $V$ , and the diagonal square $q \times q$ matrix $\Sigma$ .

In pytorch one can use the svd_lowrank command to compute $U$ , $\Sigma$ , and $V$

U, sigma, V = torch.svd_lowrank(weight, q)

The rank r of the original $m \times n$ matrix is defined within this post as $\min(m, n)$ . In the following, I use the ratio $q / r$ instead of $q$ to describe how many singular values are calculated. This ratio then defines a single parameter for all the weight matrices upon which the approximation is applied to.

Notice that if this ratio is small enough then the total number of tSVD parameters (in $U$ , $\Sigma$ , and $V$ ) is smaller than the number of parameters in the original matrix $M$ .

Main thesis

The main point of LASER is that by substituting weight matrices with their tSVD approximation one can achieve

Better performance (on some tasks)
Lower memory footprint (fewer parameters)

Regarding point 1: The better performance is thought to be achieved by eliminating statistical noise in a Principal Component Analysis fashion.

Regarding point 2: By representing the weights using the tSVD matrices there are fewer parameters. This is a straightforward corollary of the authors' work.

Please note that the performance degrades if the "wrong layers" are subjected to the procedure. For the tested language model, the top layers seems to be most effective where to apply LASER. Please also keep in mind that at no point the model is fine tuned. The only operation over the original model is the spectral decomposition algorithm described above.

Applying the LASER technique onto the top layers

In the following only one model is used: Mistral-7B-instruct-v0.1.

The chosen ratios for tSVD are 0.1, 0.25, and 0.5. For each ratio value, the number of total parameters per layer decreases as:

Ratio	Parameter count as percentage of the original
0.1	~17 %
0.25	~37 %
0.5	~70 %

The code that applies the LASER method goes through every layer - and every Linear transformation for each - then substitutes the weight matrices with the tSVD triplet. This is how the "LASER model" is created and saved.

import sys
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


class LaserLinear(torch.nn.modules.module.Module):
    U: torch.Tensor
    sigma: torch.Tensor
    V: torch.Tensor

    def __init__(self, weight: torch.Tensor, ratio: float):
        super().__init__()
        max_rank = min(weight.shape)
        q = int(max_rank * ratio)
        U, sigma, V = torch.svd_lowrank(weight, q=q, niter=2)
        self.U = torch.nn.Parameter(U)
        self.sigma = torch.nn.Parameter(sigma)
        self.V = torch.nn.Parameter(V)

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return input @ (self.U @ torch.diag(self.sigma) @ self.V.T).T


if __name__ == "__main__":
    ratio = 0.1
    mlp_names = [
        "mlp.down_proj",
        "mlp.up_proj",
        "mlp.gate_proj",
        "self_attn.q_proj",
        "self_attn.k_proj",
        "self_attn.v_proj",
        "self_attn.o_proj",
    ]
    sys.stdout.flush()
    for layer_index in reversed(range(len(model.base_model.layers))):
        for mlp_name in mlp_names:
            original_linear = eval(f"model.base_model.layers[layer_index].{mlp_name}")
            weight = original_linear.weight
            exec(f"model.model.layers[layer_index].{mlp_name} = LaserLinear(weight, ratio)")
            sys.stdout.flush()

        torch.save(model.state_dict(), f"mistral-7b-instruct-laser-{ratio}")

The saved parameters are then merged with the original model up to a threshold layer n (chosen between 20 to 31) as in the following picture:

This procedure creates a set of models controlled by the parameter n. The idea is to evaluate how many top layers can be transformed using the LASER technique and still retain the functionality of the original model.

Evaluation on a simple generation task

First, the model is evaluated on a simple generation task. The prompt is the same for all generations: "the capital of Britain is". (Britain has no capital per se but this discussion is outside the scope of this post).

The maximum number of tokens is set to 40.

When the threshold layer is 31, the whole original model is used. When the threshold layer is 30, the LASER technique is applied only to the top layer. More generally, the original model layers are the ones below the threshold. The LASER technique is applied above the threshold layer.

Simple generation - Ratio=0.1

Threshold layer 31: <s> the capital of Britain is London.</s>
Threshold layer 30: <s> the capital of Britain is London.</s>
Threshold layer 29: <s> the capital of Britain is London. London is located in south east England and is the largest city in the United Kingdom. It is known for its history, culture, and landmarks such as Buck
Threshold layer 28: <s> the capital of Britain is London. London is located in southern England and is one of the oldest cities in the world. It is known for its rich history, culture, and landmarks such as
Threshold layer 27: <s> the capital of Britain is London London is located on England' London is known for its iconic landmarks such as Tower Bridge, Tower Eye, Buckingham Palace, West West West West West West
Threshold layer 26: <s> the capital of Britain is London. London is located southwest England near London River. London is known for its historical landmarks such as Buckingham Palace, Tower Bridge and Tower Tower. London is
Threshold layer 25: <s> the capital of Britain is London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London
Threshold layer 24: <s> the capital of Britain is London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London
Threshold layer 23: <s> the capital of Britain is London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London
Threshold layer 22: <s> the capital of Britain is London London is located southwest London London city London city London city London city London city London city London city London city London city London city London city London city London city London
Threshold layer 21: <s> the capital of Britain is London London is located London is located London London is located London London is located London London London London London London London London London London London London London London London London London London London
Threshold layer 20: <s> the capital of Britain is London London is located London is located London is located London is located London is located London is located London is located London is located London is located London is located London London London

Simple generation - Ratio=0.25

Threshold layer 31: <s> the capital of Britain is London.</s>
Threshold layer 30: <s> the capital of Britain is London.</s>
Threshold layer 29: <s> the capital of Britain is London. London is located in south east England. London is one of the most populous cities in Europe. London is known for its history, culture, and landmarks
Threshold layer 28: <s> the capital of Britain is London. London is located in southern England and is the largest city in the United Kingdom. It is known for its rich history, cultural diversity, and iconic landmarks
Threshold layer 27: <s> the capital of Britain is London. London is located in southern England. London is one of the most populous cities in Europe. London is known for its iconic landmarks such as Buckingham
Threshold layer 26: <s> the capital of Britain is London. London is located in England, which is part of Britain. London is the largest city in Britain and Europe. London is known for its iconic landmarks such
Threshold layer 25: <s> the capital of Britain is London. London is capital city of England and UK. London is capital city of England and UK. London is capital city of England and UK. London is capital city of
Threshold layer 24: <s> the capital of Britain is London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London London
Threshold layer 23: <s> the capital of Britain is London. London is located on the River Th London is located on the River Th London is located on the River Th London is located on the River Th London is located on
Threshold layer 22: <s> the capital of Britain is London. London is located on the England's southwest coast. London is the capital city of England and the capital city of London is London. London is located on
Threshold layer 21: <s> the capital of Britain is London. London is located on the River Th London is located on the River Th London is located on the River Th London is located on the River Th London is located on
Threshold layer 20: <s> the capital of Britain is London London is capital city of Britain London is capital city of Britain London is capital city of Britain London is capital

Simple generation - Ratio=0.5

Threshold layer 31: <s> the capital of Britain is London.</s>
Threshold layer 30: <s> the capital of Britain is London.</s>
Threshold layer 29: <s> the capital of Britain is London. London is located in south east England and is the largest city in the United Kingdom. It is one of the most populous cities in Europe and is known for
Threshold layer 28: <s> the capital of Britain is London. London is the largest city in Britain and one of the most populous cities in Europe. London is known for its rich history, culture, and landmarks such
Threshold layer 27: <s> the capital of Britain is London. London is located in southern England and is the largest city in Britain. London is known for its rich history, culture, and landmarks such as Buckingham Palace
Threshold layer 26: <s> the capital of Britain is London. London is the largest city in Britain and Europe. London is known for its history, culture, and landmarks such as Buckingham Palace, Tower Bridge, and
Threshold layer 25: <s> the capital of Britain is London. London is the largest city in Britain and the United Kingdom's capital and largest city. London is located in the southeast of England on the River Thames
Threshold layer 24: <s> the capital of Britain is London. London is the capital city of England and the United Kingdom. London is located in south-east England, on the River Thames. London is one of the
Threshold layer 23: <s> the capital of Britain is London. London is one of the most populous cities in Europe and is home to many famous landmarks such as Buckingham Palace, the Tower of London, and the
Threshold layer 22: <s> the capital of Britain is London. London is located in south east England. London is the largest city in Britain and the fourth largest city in Europe by population. London is home to many famous land
Threshold layer 21: <s> the capital of Britain is London. London is located in south east England. London is the largest city in Britain and the fourth largest city in Europe. London is home to many famous landmarks such
Threshold layer 20: <s> the capital of Britain is London. London is located in south England. London is the largest city in Britain and the United Kingdom. London is home to many famous landmarks including Buckingham Palace,

Evaluation on HumanEval

Using the same merging technique as above, one can evaluate the resulting model on the HumanEval dataset. In the interest of computing time, only the metric Pass@1 is considered and only the top 4 layers are substituted. The maximum number of tokens for the generation task is set to 1024. For each ratio value the best result is marked in bold.

Threshold layer	Pass@1 with ratio=0.1	Pass@1 with ratio=0.25	Pass@1 with ratio=0.5
31 (original model)	0.1768	0.1768	0.1768
30	0.1707	0.1403	0.1829
29	0.1524	0.2012	0.2134
28	0.0183	0.1463	0.2134
27	0.0060	0.0366	0.2195

Discussion

The LASER technique - as implemented here - does indeed seem to bring higher performance, at least on the HumanEval test. One can see that by using a ratio $q / r$ of 0.25 the best results are obtained by substituting the weights in the top 2 layers with their tSVD approximation. For ratio = 0.5 the best result is achieved by using the LASER technique on the top 4 layers.

Substituting the original weight matrix with a tSVD approximation brings some memory gains. The number of total parameters as percentage of the original is presented below:

Threshold layer	Parameters % for ratio=0.1	Parameters % for ratio=0.25	Parameters % for ratio=0.5
31 (original model)	100%	100%	100%
30	~ 97%	~ 98%	~ 99%
29	~ 95%	~ 96%	~ 98%
28	~ 92%	~ 94%	~ 97%
27	~ 90%	~ 92%	~ 96%

Conclusion

The work presented in this post aims at building the smallest possible model that can perform. For this reason, it is not directly comparable to the original paper.

Even so, the LASER technique - which implements tSVD to approximate linear transformations - is effective in memory reduction and can succesfully be applied onto the Mistral-7B-instruct model. In the limited testing done in this blog there is some evidence of performance increase.

It is quite puzzling to see an approximation work better than the original model! This method looks like a promising start for more exploratory work. In the end - within the narrow and temporary scope of this blog - the tSVD process performs well if:

the approximation is applied only to a handful of layers at the top of the model.
the ratio $q / r$ is greater or equal than 0.25 - i.e., the number of truncated singular values should be at least 25% of the rank of the original matrix.

Resources

The paper itself is here.
The blog post by the authors is here. It contains a link to the paper's code.
The github repository associated to this blog post (not the original paper's) is here.

Acknoledgements

This paper has been suggested to me by @MickeyShaughnes on Twitter.

Citation

@misc{sharma2023truth,
      title={The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction}, 
      author={Pratyusha Sharma and Jordan T. Ash and Dipendra Misra},
      year={2023},
      eprint={2312.13558},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Upvote