TheBloke/Llama-2-13B-chat-GPTQ · HuggingFace's bitsandbytes vs AutoGPTQ?

Aug 15, 2023

Hi all,

Sorry for asking here, but I don't really understand whats the trade-offs/purpose between using bitsandbytes/autogptq for quantization.

I just tried loading a base meta model using "load_in_8bit=True" in AutoModelForCausalLM.from_pretrained(), the speed was crazy fast. The model was converted to 8bit/4bit using the bitsandbytes integration with just adding another parameter, instead of needing to quantize with something like AutoGPTQ. VRAM usage was also lower, which is good I assume. I am still in the exploration phase, trying to understand whats the best and easiest way to quantize the model.

Hope someone could shed a light on this. Thanks.

pimwipa

Sep 25, 2023

I just read this https://huggingface.co/blog/overview-quantization-transformers and it says that bitsandbytes is easier to use, but a bit slower in larger batch size. Now they are both native within Huggingface's transformers. I am also learning about quantization.

YaTharThShaRma999

Sep 25, 2023

•

edited Sep 25, 2023

Gptq is usually much faster than bitsandbytes and is supposed to use less memory?

I think the reason it was slow might have been that your model was doing inference on cpu instead of gpu but bitsandbytes automatically does inference on gpu.

Also, fastest inference right now would be exllama with gptq but that only supports llama models or any fine tuned variants like this. It could easily reach over 40 tok per sec on a free colab.