Question: Maximizing GPU Utilization for Inference

#24
by ric1732 - opened

I have a code snippet for text generation using Hugging Face's Transformers library. I am running inference on a machine with 8 GPUs. However, during inference, only 2 or 3 GPUs are being utilized and the GPU utilization remains below 32%. I want to optimize my code to utilize the full power of all available all 8 GPUs.

Here is the code I am currently using:

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

batch_size = 32

if __name__=='__main__':
    txt_list = [
            "The King is dead. Long live the Queen.",
            "Once there were four children whose names were Peter, Susan, Edmund, and Lucy.",
            "The story so far: in the beginning, the universe was created.",
            "It was a bright cold day in April, and the clocks were striking thirteen.",
            "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.",
            "The sweat wis lashing oafay Sick Boy; he wis trembling.",
            "124 was spiteful. Full of Baby's venom.",
            "As Gregor Samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect.",
            "I write this sitting in the kitchen sink.",
            "We were somewhere around Barstow on the edge of the desert when the drugs began to take hold.",
        ] * 500
    lf = len(txt_list)

    tokenizer = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-v0.1')
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.padding_side = "left"
    model = AutoModelForCausalLM.from_pretrained('mistralai/Mixtral-8x7B-v0.1', device_map='auto')

    out_list = []
    n_steps = math.ceil(lf/batch_size)

    for btx in range(n_steps):
        t_sens = txt_list[btx*batch_size:(btx+1)*batch_size]
        t_toks = tokenizer(t_sens, return_tensors='pt', padding=True).to('cuda')

        opt = model.generate(**t_toks, max_new_tokens=200)

        for jty in range(batch_size):
            ctxt = tokenizer.decode(opt[jty], skip_special_tokens=True)
            ctxt = ctxt[len(t_sens[jty]):].strip()
            out_list.append({'input':t_sens[jty], 'output':ctxt})

    str_list = [json.dumps(xx) for xx in out_list]
    otf = open('rrr','w')
    otf.write('\n'.join(str_list))
    otf.close()

While the code is functional, the GPU utilization is restricted to at most 3 GPUs and not reaching its full potential. How can I modify the code to ensure maximum GPU utilization during the inference step?

Thank you!

ric1732 changed discussion title from Question: Maximizing GPU Utilization for Inference with Transformers to Question: Maximizing GPU Utilization for Inference

Sign up or log in to comment