Error converting Salesforce/blip-image-captioning-base

#15
by 99s42m - opened

Hello,

I am very new to HuggingFace and machine learning in general. I understand that the Blip model is not supported for conversion to coreml. Is there a way I can write my own conversion code?

Thanks

Conversion Settings:

        Model: Salesforce/blip-image-captioning-base
        Task: None
        Framework: None
        Compute Units: None
        Precision: None
        Tolerance: None
        Push to: None

        Error: "blip is not supported yet. Only ['bart', 'beit', 'bert', 'big_bird', 'bigbird_pegasus', 'blenderbot', 'blenderbot_small', 'bloom', 'convnext', 'ctrl', 'cvt', 'data2vec', 'distilbert', 'ernie', 'gpt2', 'gpt_neo', 'levit', 'm2m_100', 'marian', 'mobilebert', 'mobilevit', 'mvp', 'pegasus', 'plbart', 'roberta', 'roformer', 'segformer', 'splinter', 'squeezebert', 't5', 'vit', 'yolos'] are supported. If you want to support blip please propose a PR or open up an issue."


        
Core ML Projects org

Hello @99s42m !

Thanks for reporting this! We'll take a look and see if we can add support for blip soon. Meanwhile, you could try to use coremltools directly. coremltools is a Python package created by Apple that can convert PyTorch and Tensorflow models to Core ML. This conversion Space is based on exporters, which in turn uses coremltools under the hood.

@pcuenq Thank you so much for your response.

Here's where I have gotten so far:

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://images.nationalgeographic.org/image/upload/t_edhub_resource_key_image/v1638882947/EducationHub/photos/tourists-at-victoria-falls.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "The main geographical feature in this photo is a"
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens = 20)
print(processor.decode(out[0], skip_special_tokens=True))

import coremltools as ct
import torch
import torchvision
example_input = torch.rand(1, 3, 224, 224) 
traced_model = torch.jit.trace(model, inputs['input_ids'])
out = traced_model(example_input)
 )

The above code throws the following error:
RuntimeError: Input type (long int) and bias type (float) should be the same

I understand that you are busy and this might be a basic question, but any help would be greatly appreciated.

Sign up or log in to comment