Text Generation
Transformers
PyTorch
Safetensors
llama
Inference Endpoints
text-generation-inference

compatible with Llama

#29
by cArlIcon - opened
No description provided.
richardllin changed pull request status to open
richardllin changed pull request status to merged

Yi-34B's generation became 10x slower on 4xA10 GPUs after replacing YiForCausalLM with LlamaForCausalLM.
Any idea why?

Hi @rodrigo-nogueira not sure what's the root cause, but do you want to give Flash Attention a try by invoking the model with use_flash_attention_2=True?

More context can be found from:
https://huggingface.co/docs/transformers/v4.35.2/en/perf_infer_gpu_one#Flash-Attention-2

Thank you very much, it is much faster now.

Sign up or log in to comment