Runpod Deployment Troubleshooting

by grandignatz - opened Jan 5

Jan 5

The runpod template installation always gets stuck at:
text_generation_launcher: Download file: model-00006-of-00019.safetensors

We tried multiple Pods (H100, A100, A6000) and everywhere it gets stuck at model part 6.

Runpod Support was not able to help.

RonanMcGovern

Trelis org Jan 6

Yeah, you can get it to work by reloading the pod a few times, this restarts the downloading from the last shard and eventually you'll get them loaded.

I'm working on pushing 8bit and 4bit models to hub that will reduce download size and speed and maybe side-step the issue. I've already updated the runpod template to download 8bit weights. Testing that now and will get it working on Monday.

grandignatz

Jan 6

Ok, great thank you. I already tried restarting the different pods like 20 times yesterday but never got get past number 6. Will try again.

grandignatz

Jan 6

I tried it again and it seams like it now used the 8 bit branch with only 5 parts. It completed the download but i only get a empty response from the api:
{"generated_text":""}

Server log says:
generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(200), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="15.697707697s" validation_time="465.588µs" queue_time="86.82µs" inference_time="15.697155549s" time_per_token="78.485777ms" seed="None"}: text_generation_router::server: router/src/server.rs:289: Success

With generate_streame i get:
data:{"token":{"id":0,"text":"","logprob":null,"special":true},"generated_text":null,"details":null}

RonanMcGovern changed discussion title from Issue Runpod Deployment to Runpod Deployment Troubleshooting Jan 8

RonanMcGovern

Trelis org Jan 8

•

edited Jan 8

Hi folks, some guidance here.

Best Current Approach (use the main branch and then --quantize eetq):

I've just set the pod to download the 16-bit weights from the main branch.
There are often issues downloading the weights. Downloading gets stuck at various points and requires you to click on the three lines and then "Restart Pod" in a few points. Typically I need to re-start 3-4 times to get all of the weights downloaded. After download, it can take 10-15 mins for the shards to load onto the GPU (at least for an A6000).
The Runpod template is the one on the main model card.

Work in Progress #A

Ideally, rather than downloading the full 16-bit weights, we would download 8-bit or 4-bit (nf4) weights.
However, there is a bug stopping 8 and 4-bit weights being pushed to hub. I have opened issues (4bit, 8bit) and will write back here when I have more updates.

Work in Progress #B

I don't know the root cause of the weights getting stuck, but I see that issue with the raw Mixtral model as well. I have posted an issue on that in TGI.

reedbender

Jan 10

I have been able to get past the model weights being stuck, but once I get the model successfully deployed the generated_text is still empty. Is there a resolution for the response being empty after successful deployment?

RonanMcGovern

Trelis org Jan 10

Howdy Reed. I've just tested again and the template on the model card is working. For example - using the ADVANCED inference repo - I'm getting:

user: What clothes should I wear? I am in Dublin

function_call: {
    "name": "get_current_weather",
    "arguments": {
        "city": "Dublin"
    }
}

function_response: {
    "temperature": "18 C",
    "condition": "Partly Cloudy"
}

assistant:You should wear a sweater

This matches the YouTube video about Mixtral. I also ran a test with no functions and it ran fine (a speed test as per the youtube video).

Are you using apply_chat_template ? The prompt formatting is crucial.

P.S. I'm working on making an AWQ template now that should be quicker to download.

RonanMcGovern

Trelis org Jan 10

Ok, the AWQ one-click runpod template is now on the model card. This is now the recommended way to run inference. The model is about 25 GB (instead of ~100 GB) so it will be quicker to download.

I'm therefore closing this issue.

The downloading bug with TGI remains open on their github here.

If you face new issues, just create a new issue - and please provide enough details that I can replicate.

RonanMcGovern changed discussion status to closed Jan 10

reedbender

Jan 10

This is working for me now, thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment