Runpod Deployment Troubleshooting
Yeah, you can get it to work by reloading the pod a few times, this restarts the downloading from the last shard and eventually you'll get them loaded.
I'm working on pushing 8bit and 4bit models to hub that will reduce download size and speed and maybe side-step the issue. I've already updated the runpod template to download 8bit weights. Testing that now and will get it working on Monday.
Ok, great thank you. I already tried restarting the different pods like 20 times yesterday but never got get past number 6. Will try again.
I tried it again and it seams like it now used the 8 bit branch with only 5 parts. It completed the download but i only get a empty response from the api:
{"generated_text":""}
Server log says:
generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(200), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None } total_time="15.697707697s" validation_time="465.588µs" queue_time="86.82µs" inference_time="15.697155549s" time_per_token="78.485777ms" seed="None"}: text_generation_router::server: router/src/server.rs:289: Success
With generate_streame i get:
data:{"token":{"id":0,"text":"","logprob":null,"special":true},"generated_text":null,"details":null}
Hi folks, some guidance here.
Best Current Approach (use the main branch and then --quantize eetq):
- I've just set the pod to download the 16-bit weights from the main branch.
- There are often issues downloading the weights. Downloading gets stuck at various points and requires you to click on the three lines and then "Restart Pod" in a few points. Typically I need to re-start 3-4 times to get all of the weights downloaded. After download, it can take 10-15 mins for the shards to load onto the GPU (at least for an A6000).
- The Runpod template is the one on the main model card.
Work in Progress #A
- Ideally, rather than downloading the full 16-bit weights, we would download 8-bit or 4-bit (nf4) weights.
- However, there is a bug stopping 8 and 4-bit weights being pushed to hub. I have opened issues (4bit, 8bit) and will write back here when I have more updates.
Work in Progress #B
- I don't know the root cause of the weights getting stuck, but I see that issue with the raw Mixtral model as well. I have posted an issue on that in TGI.
I have been able to get past the model weights being stuck, but once I get the model successfully deployed the generated_text is still empty. Is there a resolution for the response being empty after successful deployment?
Howdy Reed. I've just tested again and the template on the model card is working. For example - using the ADVANCED inference repo - I'm getting:
user: What clothes should I wear? I am in Dublin
function_call: {
"name": "get_current_weather",
"arguments": {
"city": "Dublin"
}
}
function_response: {
"temperature": "18 C",
"condition": "Partly Cloudy"
}
assistant:You should wear a sweater
This matches the YouTube video about Mixtral. I also ran a test with no functions and it ran fine (a speed test as per the youtube video).
Are you using apply_chat_template ? The prompt formatting is crucial.
P.S. I'm working on making an AWQ template now that should be quicker to download.
Ok, the AWQ one-click runpod template is now on the model card. This is now the recommended way to run inference. The model is about 25 GB (instead of ~100 GB) so it will be quicker to download.
I'm therefore closing this issue.
The downloading bug with TGI remains open on their github here.
If you face new issues, just create a new issue - and please provide enough details that I can replicate.
This is working for me now, thanks!