Nomic-ai-embedding fine tuning with SentenceTransformersFinetuneEngine

#18
by Miheer29 - opened

Hi

Im trying to finetune Nomic-ai-embedding using SentenceTransformersFinetuneEngine and am running into an issue:

from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset, # Dataset to be trained on
model_id="nomic-ai/nomic-embed-text-v1.5", # HuggingFace reference to base embeddings model
model_output_path="llama_model_v1", # Output directory for fine-tuned embeddings model
val_dataset=test_dataset, # Dataset to validate on
epochs=2, # Number of Epochs to train for
)

Error:
image.png

Nomic AI org

I would reach out to the package SentenceTransformers as I don't have as deep knowledge of what's going on there

zpn changed discussion status to closed
Nomic AI org
edited Apr 25

Hello!

I'm afraid that this is not currently conveniently possible, because this SentenceTransformer instance must be initialized here with trust_remote_code=True as the model must pull code from Hugging Face. I would recommend opening an issue in LlamaIndex for it.

That said, I think you should be able to solve your problem. You can first download the model to a local directory. Then, you can download these two files and also place them in the repository:

Then, you must update your local config.json to no longer say:

  "auto_map": {
    "AutoConfig": "nomic-ai/nomic-embed-text-v1--configuration_hf_nomic_bert.NomicBertConfig",
    "AutoModel": "nomic-ai/nomic-embed-text-v1--modeling_hf_nomic_bert.NomicBertModel",
    "AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
  },

but instead to say:

  "auto_map": {
    "AutoConfig": "configuration_hf_nomic_bert.NomicBertConfig",
    "AutoModel": "modeling_hf_nomic_bert.NomicBertModel",
  },

Now these files are local, and we don't need to download them from Hugging Face. As a result, you should now be able to initialize the SentenceTransformersFinetuneEngine with the path to your local directory. It should then no longer complain about the lack of trust_remote_code=True.

@Miheer29

  • Tom Aarsen

thank you tom!

do i need just the model tensors and config.json or would i need to clone the entire repo?

Nomic AI org

You should probably just clone the entire repo

thank you!

also, how do i use the model with SentenceTransformersFinetuneEngine ? because there is only a model_id parameter in SentenceTransformersFinetuneEngine , there is no way to pass the actual model

would you recommend cloning the repo , making the changes and uploading the model to huggingface? if so , would i need to make any other changes to the files?

Nomic AI org

how do i use the model with SentenceTransformersFinetuneEngine ?

model_id can also be a path to a local model, you should use that instead.

And no, I wouldn't upload it to Hugging Face for this, because then it still has to pull code from Hugging Face and it'll still need trust_remote_code=True.

This comment has been hidden

hi @tomaarsen is there anything else i can do to solve my issue

Sign up or log in to comment