torch.cuda.OutOfMemoryError: CUDA out of memory.

#114
by sonwh98 - opened

I have nvidia RTX 4070 super and threadripper with 64GB of ram but running into memory problems. This should be enough hardware to run llama3 locally right?

import transformers                                                                                                                                                                                                                                          
import torch                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                
model_id = "meta-llama/Meta-Llama-3-8B"                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                             
pipeline = transformers.pipeline(                                                                                                                                                                                                                            
    "text-generation",                                                                                                                                                                                                                                       
    model=model_id,                                                                                                                                                                                                                                          
    model_kwargs={"torch_dtype": torch.bfloat16},                                                                                                                                                                                                            
    device="cuda",                                                                                                                                                                                                                                           
)  

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.38s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "", line 1, in
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/init.py", line 1108, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 84, in init
super().init(*args, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 882, in init
self.model.to(self.device)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2692, in to
return super().to(*args, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 11.72 GiB of which 94.50 MiB is free. Including non-PyTorch memory, this process has 11.60 GiB memory in use. Of the allocated memory 11.42 GiB
is allocated by PyTorch, and 1.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (h
ttps://pytorch.org/docs/stable/notes/cuda.html#environment-variables)


Its should work, I'm running Llama3 8b and its works find (In Ollama that is), so it is possible:)

I keep asking myself why the Ollama runs so much smoother and when using the transfomers sometimes it doesn't run at all. I`m using RTX3070 12gb and Ryzen 7

I keep asking myself why the Ollama runs so much smoother and when using the transfomers sometimes it doesn't run at all. I`m using RTX3070 12gb and Ryzen 7

have you find any solution?

Yes, I was missing the quantization part! @Littlebox692

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

AutoModelForCausalLM.from_pretrained(
            dir,
            device_map="auto",
            quantization_config=bnb_config,
        )

This "quantization_config" passed to AutoModel constructor did the trick. Basicallt a conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer.

Meta Llama org

Hi
Indeed, using quantized versions of the model can help reduce the required VRAM to run the model, please see: https://huggingface.co/docs/transformers/quantization for more details on all supported quant schemes

Sign up or log in to comment