torch.cuda.OutOfMemoryError: CUDA out of memory.

#114

by sonwh98 - opened 27 days ago

27 days ago

•

I have nvidia RTX 4070 super and threadripper with 64GB of ram but running into memory problems. This should be enough hardware to run llama3 locally right?

import transformers                                                                                                                                                                                                                                          
import torch                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                
model_id = "meta-llama/Meta-Llama-3-8B"                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                             
pipeline = transformers.pipeline(                                                                                                                                                                                                                            
    "text-generation",                                                                                                                                                                                                                                       
    model=model_id,                                                                                                                                                                                                                                          
    model_kwargs={"torch_dtype": torch.bfloat16},                                                                                                                                                                                                            
    device="cuda",                                                                                                                                                                                                                                           
)

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:13<00:00, 3.38s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "", line 1, in
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/init.py", line 1108, in pipeline
return pipeline_class(model=model, framework=framework, task=task, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 84, in init
super().init(*args, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/pipelines/base.py", line 882, in init
self.model.to(self.device)
File "/home/sto/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2692, in to
return super().to(*args, **kwargs)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1152, in to
return self._apply(convert)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply
module._apply(fn)
[Previous line repeated 2 more times]
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply
param_applied = fn(param)
File "/home/sto/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1150, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 11.72 GiB of which 94.50 MiB is free. Including non-PyTorch memory, this process has 11.60 GiB memory in use. Of the allocated memory 11.42 GiB
is allocated by PyTorch, and 1.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (h
ttps://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

TonyHartley

24 days ago

Its should work, I'm running Llama3 8b and its works find (In Ollama that is), so it is possible:)

carloshsf

20 days ago

I keep asking myself why the Ollama runs so much smoother and when using the transfomers sometimes it doesn't run at all. I`m using RTX3070 12gb and Ryzen 7

Littlebox692

18 days ago

I keep asking myself why the Ollama runs so much smoother and when using the transfomers sometimes it doesn't run at all. I`m using RTX3070 12gb and Ryzen 7

have you find any solution?

carloshsf

18 days ago

•

edited 18 days ago

Yes, I was missing the quantization part! @Littlebox692

bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
        )

AutoModelForCausalLM.from_pretrained(
            dir,
            device_map="auto",
            quantization_config=bnb_config,
        )

This "quantization_config" passed to AutoModel constructor did the trick. Basicallt a conversion of data from a 32-bit floating-point number (FP32) to an 8-bit or 4-bit integer.

ybelkada

Meta Llama org 18 days ago

Hi
Indeed, using quantized versions of the model can help reduce the required VRAM to run the model, please see: https://huggingface.co/docs/transformers/quantization for more details on all supported quant schemes

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment