gguf

#24
by LaferriereJC - opened

(textgen) [root@pve-m7330 user]# python text-generation-webui/llama.cpp/convert.py text-generation-webui/models/Phi-3-mini-128k-instruct/
Loading model file text-generation-webui/models/Phi-3-mini-128k-instruct/model-00001-of-00002.safetensors
Loading model file text-generation-webui/models/Phi-3-mini-128k-instruct/model-00001-of-00002.safetensors
Loading model file text-generation-webui/models/Phi-3-mini-128k-instruct/model-00002-of-00002.safetensors
Traceback (most recent call last):
File "/home/user/text-generation-webui/llama.cpp/convert.py", line 1483, in
main()
File "/home/user/text-generation-webui/llama.cpp/convert.py", line 1430, in main
params = Params.load(model_plus)
File "/home/user/text-generation-webui/llama.cpp/convert.py", line 317, in load
params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
File "/home/user/text-generation-webui/llama.cpp/convert.py", line 229, in loadHFTransformerJson
raise NotImplementedError(f'Unknown rope scaling type: {typ}')
NotImplementedError: Unknown rope scaling type: su
(textgen) [root@pve-m7330 user]#

Did you try the convert-hf-to-gguf.py? convert.py is a specialized script and has never worked for any phi models.

Microsoft org

microsoft/Phi-3-mini-128k-instruct does not support llama.cpp due the rope scaling type.

@caiomms thats not true because someone else did it https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed

Does this version actually work? I tried to import several gguf files from two different people and in both cases ollama, although able to import, is unable to run them.

@BigDeeper gosh I hope that isn't the case I just spent the last 18 hours fine tuning this version with plans of making it into a gguf

@BigDeeper gosh I hope that isn't the case I just spent the last 18 hours fine tuning this version with plans of making it into a gguf

I hope your version works too. Maybe I and everyone else can use it.

microsoft/Phi-3-mini-128k-instruct does not support llama.cpp due the rope scaling type.

The latest version of llama.cpp seems to be able to load it at least (the Q8_0) but gives some gibberish, hopefully because I didn't prompt it correctly.

I don't know why but I can run this model pjh64/Phi-3-mini-128K-Instruct.gguf/phi-3-mini-128K-Instruct_q8_0.gguf using llama.cpp/main (still need to figure out correct prompting, the stuff here must be wrong), but when I convert the same model with the latest ollama (built today), I get this "Error: llama runner process no longer running: -1" although the ollama serve is still running.

I tried the same thing as the original poster. Same result. Not sure how others got the conversion done.

(Pythogora) developer@ai:/mnt$ python ~/llama.cpp/convert.py ./Phi-3-mini-128k-instruct --outtype q8_0
Loading model file Phi-3-mini-128k-instruct/model-00001-of-00002.safetensors
Loading model file Phi-3-mini-128k-instruct/model-00001-of-00002.safetensors
Loading model file Phi-3-mini-128k-instruct/model-00002-of-00002.safetensors
Traceback (most recent call last):
File "/home/developer/llama.cpp/convert.py", line 1555, in
main()
File "/home/developer/llama.cpp/convert.py", line 1498, in main
params = Params.load(model_plus)
File "/home/developer/llama.cpp/convert.py", line 328, in load
params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
File "/home/developer/llama.cpp/convert.py", line 237, in loadHFTransformerJson
raise NotImplementedError(f'Unknown rope scaling type: {typ}')
NotImplementedError: Unknown rope scaling type: su
(Pythogora) developer@ai:/mnt$ python ~/llama.cpp/convert.py ./Phi-3-mini-128k-instruct --outtype q8_0

The default instruction template is wrong. It places <|endoftext|> where it should place <|assistant|>. That is why the gibberish occurs

Same error as the OP with the latest b2731:

Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00001-of-00002.safetensors
Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00001-of-00002.safetensors
Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00002-of-00002.safetensors
Traceback (most recent call last):
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 1555, in
main()
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 1498, in main
params = Params.load(model_plus)
^^^^^^^^^^^^^^^^^^^^^^^
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 328, in load
params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 237, in loadHFTransformerJson
raise NotImplementedError(f'Unknown rope scaling type: {typ}')
NotImplementedError: Unknown rope scaling type: su

Anyone found a solution?

Same error as the OP with the latest b2731:

Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00001-of-00002.safetensors
Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00001-of-00002.safetensors
Loading model file M:\Storage\LLMs\Microsoft-Phi-3-Mini-128k-Instruct\model-00002-of-00002.safetensors
Traceback (most recent call last):
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 1555, in
main()
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 1498, in main
params = Params.load(model_plus)
^^^^^^^^^^^^^^^^^^^^^^^
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 328, in load
params = Params.loadHFTransformerJson(model_plus.model, hf_config_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "M:\Storage\Softwares and drivers\To Add\Programming & Dev Tools\LLM-Tools\llama.cpp-b2731\llama.cpp\convert.py", line 237, in loadHFTransformerJson
raise NotImplementedError(f'Unknown rope scaling type: {typ}')
NotImplementedError: Unknown rope scaling type: su

Anyone found a solution?

Obviously some people were able to create gguf files. So they found a work around. You may still have a problem running it. I am running it using ollama, but I had to reduce the context to 60K, and there is some strange problem with ollama loading multiple duplicate weights into my GPUs.

Solved

Thanks @aberrio for suggesting the convert-hf-to-gguf.py script!

The following worked no issues:

python .\convert-hf-to-gguf.py .\Microsoft-Phi-3-Mini-128k-Instruct --outtype f32 --outfile MS-Phi-3-mini-128k-Instruct-F32.bin

Converted to an FP32-bIn to prevent quality loss as the model uses BF16

From there quantize as normal:

quantize .\MS-Phi-3-mini-128k-Instruct-F32.bin .\MS-Phi-3-mini-128k-Instruct.Q8_0.gguf Q8_0

Solved

Thanks @aberrio for suggesting the convert-hf-to-gguf.py script!

The following worked no issues:

python .\convert-hf-to-gguf.py .\Microsoft-Phi-3-Mini-128k-Instruct --outtype f32 --outfile MS-Phi-3-mini-128k-Instruct-F32.bin

Converted to an FP32-bIn to prevent quality loss as the model uses BF16

From there quantize as normal:

quantize .\MS-Phi-3-mini-128k-Instruct-F32.bin .\MS-Phi-3-mini-128k-Instruct.Q8_0.gguf Q8_0

Well, you can convert the formats alright, but the resulting gguf files are somehow defective. I cannot use ollama to import them and then run them. There is an error about the context. Some people produced gguf files that I can run by reducing the context size substantially. I just tried creating Q8_0 and Q6_K quants, and I cannot run either of them.

I don't know I've been using it a bit and it honestly seems great so far

I don't know I've been using it a bit and it honestly seems great so far

Which specific flavor are you using?

Do you use llama.cpp/main or llama.cpp/server?

llama.cpp/server

Solved

Thanks @aberrio for suggesting the convert-hf-to-gguf.py script!

The following worked no issues:

python .\convert-hf-to-gguf.py .\Microsoft-Phi-3-Mini-128k-Instruct --outtype f32 --outfile MS-Phi-3-mini-128k-Instruct-F32.bin

Converted to an FP32-bIn to prevent quality loss as the model uses BF16

From there quantize as normal:

quantize .\MS-Phi-3-mini-128k-Instruct-F32.bin .\MS-Phi-3-mini-128k-Instruct.Q8_0.gguf Q8_0

Well, you can convert the formats alright, but the resulting gguf files are somehow defective. I cannot use ollama to import them and then run them. There is an error about the context. Some people produced gguf files that I can run by reducing the context size substantially. I just tried creating Q8_0 and Q6_K quants, and I cannot run either of them.

Does ollama have a cap on its context size or is it my machine? I also have to use below 60,000 context for ollama to not crash. Curious what context size you are running?

Solved

Thanks @aberrio for suggesting the convert-hf-to-gguf.py script!

The following worked no issues:

python .\convert-hf-to-gguf.py .\Microsoft-Phi-3-Mini-128k-Instruct --outtype f32 --outfile MS-Phi-3-mini-128k-Instruct-F32.bin

Converted to an FP32-bIn to prevent quality loss as the model uses BF16

From there quantize as normal:

quantize .\MS-Phi-3-mini-128k-Instruct-F32.bin .\MS-Phi-3-mini-128k-Instruct.Q8_0.gguf Q8_0

Well, you can convert the formats alright, but the resulting gguf files are somehow defective. I cannot use ollama to import them and then run them. There is an error about the context. Some people produced gguf files that I can run by reducing the context size substantially. I just tried creating Q8_0 and Q6_K quants, and I cannot run either of them.

Does ollama have a cap on its context size or is it my machine? I also have to use below 60,000 context for ollama to not crash. Curious what context size you are running?

Well, the one I got working before I was did a binary search and got 60000 as the highest that worked. The two quants I did, I tested several pretty low numbers and none of them worked.
For the quants that I did, ollama serve complained about not recognizing "phi3" as the architecture. That's odd. Why does the other work at all, it is also phi3?

Yeah 128,000K context was out of reach for me too. Running a modest RTX 3090 and 32GB SysRAM and got an "out of memory" error when trying to load it with -c 128000 lol. 70000 seems doable though.

Can someone at Microsoft publish official GGUF version of this model, so llama.cpp community can use it as an official version for comparison with outer quantization/conversions?

If it matters, I tried to fine-tune a model using SFT from hugging face on 128k. When I tried to combine that Lora with base model, it told me the tenors were off by 20. also, when you go to fine-tune these models, it says that you need to allow for custom remote code or something.

Microsoft may have made this MIT open source but they did a bunch of funny things that aren’t allowing any of us to fine tune or convert it properly to gguf.

I’m calling on whoever’s reading this at Microsoft to sort this out. Otherwise, sticking MIT on this is silly, because none of us can alter it.

If it matters, I tried to fine-tune a model using SFT from hugging face on 128k. When I tried to combine that Lora with base model, it told me the tenors were off by 20. also, when you go to fine-tune these models, it says that you need to allow for custom remote code or something.

Microsoft may have made this MIT open source but they did a bunch of funny things that aren’t allowing any of us to fine tune or convert it properly to gguf.

I’m calling on whoever’s reading this at Microsoft to sort this out. Otherwise, sticking MIT on this is silly, because none of us can alter it.

The fact that it is asking you to set a flag to allow remote code to run is fairly normal. When I saw this first, it also made me back out. It is dependent on some library on HF, and thus needs the flag to run.

@BigDeeper can you explain why I trained LoRA like I always do but it took 24 hours on an L4. And then when I tried to merge the model with the base phi-3 128K it told me tensors were off by 20 .. example 64220 vs 64260 and it couldn’t merge. I was crushed.

@BigDeeper can you explain why I trained LoRA like I always do but it took 24 hours on an L4. And then when I tried to merge the model with the base phi-3 128K it told me tensors were off by 20 .. example 64220 vs 64260 and it couldn’t merge. I was crushed.

It might be one of the universe's mysteries, like why it expands at different rates in different directions. :-) I don't want to make light of the fact you lost time and money (probably), it would piss off anyone.

when loading in ollama, getting llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'phi3'.
Am I the only one?

Microsoft org

I know that llama.cpp and llama-cpp-python already support Phi-3 on their main branches.

Still haven’t tested with Ollama, but will update here soon.

@gugarosa it appears to support Phi-3, but there is no longrope implementation in llama.cpp. You can see the relevant comments starting around here in the GitHub issue #6849

It sounds like the way to get 128k supported in llama.cpp is to implement longrope . If anyone on your team has CPP expertise they could accelerate that.

python convert-hf-to-gguf.py Phi-3-mini-128k-instruct --outtype f16 --outfile MS-Phi-3-mini-128k-Instruct-F32.bin.
Using this, i got below issues.
Loading model: Phi-3-mini-128k-instruct
Traceback (most recent call last):
File "/home/himanshu/Desktop/office/llamaCpp/llama.cpp/convert-hf-to-gguf.py", line 1354, in
main()
File "/home/himanshu/Desktop/office/llamaCpp/llama.cpp/convert-hf-to-gguf.py", line 1335, in main
model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian)
File "/home/himanshu/Desktop/office/llamaCpp/llama.cpp/convert-hf-to-gguf.py", line 57, in init
self.model_arch = self._get_model_architecture()
File "/home/himanshu/Desktop/office/llamaCpp/llama.cpp/convert-hf-to-gguf.py", line 254, in _get_model_architecture
raise NotImplementedError(f'Architecture "{arch}" not supported!')
NotImplementedError: Architecture "Phi3ForCausalLM" not supported!

@himka420 You have an issue with llama.cpp. This is the model page. This is not the right place to ask for help with llama.cpp Phi3ForCausalLM was added to llama.cpp. Please update to the latest version.

nguyenbh changed discussion status to closed

Sign up or log in to comment