openai/clip-vit-base-patch32 · Fine-tuning CLIP model for image-image search

Hi all, I've been working on image-image search tasks and CLIP has work really well for me, currently I want to take the performance of my approach further and I was thinking in fine tuning the CLIP model for this task. For this, I'm just generating the embeddings of the images, store them in a vector index and the just computing the cosine similarity between the embedding of my search image and all the embeddings in the vector index. Im not really using any zero-shot application or image-text comparison and I've seen all the fine-tuning approaches for CLIP models I read use text-image pairs for the fine tuning, I don't understand how I should fine tune the model to increase the performance of my application, should I use text-image pairs? Or should I only fine tune the visual encoder of the model, and if thats the case anyone has some examples of how can i do it?