blog: https://huggingface.co/blog/paligemma

#2
by NickyNicky - opened

Thanks for the model, I am following the steps and completing some on the blog but when I run the one to train it tells me the following:

image.png

image.png

image.png

I share the 'Colab' link:
https://colab.research.google.com/drive/1eSJoBGOO0_oulB5gfXqkhtIqiLngKBwy?usp=sharing

I would also like to know if it is also necessary to implement:
model.hidden_activation= "gelu_pytorch_tanh"
He asks me by message.

image.png

Is it also possible to implement flash-attn?

I also wanted to know if it is compatible with the library:
from trl import SFTTrainer

thank you so much.

Hello, SFTTrainer is just a wrapper around the Trainer so I think it should work although it has some features on top like neftune which I don't know if they would work. About Gemma warnings you can ignore them. For index error let me check, I've wrote that part and I ran it a ton of times so it shouldn't've happened 😅

Google org
This comment has been hidden
Google org

Let me give you my training script that for sure works in the meanwhile I figure out what line I missed when I moved to blog:

from datasets import load_dataset
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch
import os
import torch
from PIL import Image
from transformers import TrainingArguments, Trainer

def collate_fn(examples):
  texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
  images = [example["image"].convert("RGB") for example in examples]

  tokens = processor(text=texts, images=images,
                    return_tensors="pt", padding="longest",
                    tokenize_newline_separately=False)

  labels = tokens["input_ids"].clone()#.squeeze()

  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == 256000] = -100
  tokens["labels"] = labels

  tokens = tokens.to(DTYPE).to("cuda")
  return tokens

ds = load_dataset('HuggingFaceM4/VQAv2', split="train")

ds_remove = ["question_type", "answer_type", "answers", "image_id", "question_id"]
ds = ds.remove_columns(ds_remove)

model_id = "google/paligemma-3b-pt-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16) 
processor = PaliGemmaProcessor.from_pretrained(model_id)
print("initialized processor")

DTYPE = model.dtype

for param in model.vision_tower.parameters():
    param.requires_grad = False

    # todo: try again with projector unfrozen
for param in model.multi_modal_projector.parameters():
    param.requires_grad = False

ds = ds.train_test_split(test_size=0.1)
train_ds = ds["train"]
val_ds = ds["test"]


args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=100,
            output_dir="./output10",
            optim="adamw_hf",
            save_strategy="steps",
            save_steps=1000,
            #optim="paged_adamw_8bit",
            push_to_hub=True,
            save_total_limit=1,
            bf16=True,
            report_to=["tensorboard"],
            dataloader_pin_memory=False
        )

trainer = Trainer(
        model=model,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        data_collator=collate_fn,
        args=args
        )
print("initialized trainer")
print("Current device:", trainer.model.device)

trainer.train()

trainer.push_to_hub()

@NickyNicky in blog post I forgot to pass in remove_unused_columns=False hence the error 🤦‍♀️ irrelevant but using data collator we also need to pass dataloader_pin_memory=False (related if you load data from CPU to GPU)

Don't worry, we all made mistakes, thank you very much for the prompt response, I'm going to try the code.

I also have another question, wasn't this model trained with a template?

How does the model know what the beginning and end of a response is without the tokens or what were the ones used for this model?

Your code.

def collate_fn(examples):
  texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
  images = [example["image"].convert("RGB") for example in examples]

  tokens = processor(text=texts, images=images,
                    return_tensors="pt", padding="longest",
                    tokenize_newline_separately=False)

  labels = tokens["input_ids"].clone()#.squeeze()

  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == 256000] = -100
  tokens["labels"] = labels

  tokens = tokens.to(DTYPE).to("cuda")
  return tokens

I added this code but I don't know if it's right.

device = "cuda"

image_token = processor.tokenizer.convert_tokens_to_ids("<image>")
def collate_fn(examples):
  
  # texts = ["answer " + example["question"] + "\n" + example['multiple_choice_answer'] for example in examples]
  # prompt= template.replace("{text_user}",example["question"]).replace("{text_user}",example['multiple_choice_answer'])
  template= """<bos><start_of_turn>system\nyou are a useful AI.<end_of_turn>\n<start_of_turn>user\n{text_user}<end_of_turn>\n<start_of_turn>model\n{text_model}<end_of_turn><eos>"""
  texts = [template.replace("{text_user}",example["question"]).replace("{text_model}",example['multiple_choice_answer']) for example in examples]
  images = [example["image"].convert("RGB") for example in examples]
  tokens = processor(text=texts, images=images,
                    return_tensors="pt", padding="longest",
                    tokenize_newline_separately=False)
  labels = tokens["input_ids"].clone()
  labels[labels == processor.tokenizer.pad_token_id] = -100
  labels[labels == image_token] = -100
  tokens["labels"] = labels
  tokens = tokens.to(torch.bfloat16).to(device)
  return tokens

new code collate_fn.

template= """<bos><start_of_turn>system\nyou are a useful AI.<end_of_turn>\n<start_of_turn>user\n{text_user}<end_of_turn>\n<start_of_turn>model\n{text_model}<end_of_turn><eos>"""
texts = [template.replace("{text_user}",example["question"]).replace("{text_model}",example['multiple_choice_answer']) for example in examples]

Hello, SFTTrainer is just a wrapper around the Trainer so I think it should work although it has some features on top like neftune which I don't know if they would work. About Gemma warnings you can ignore them. For index error let me check, I've wrote that part and I ran it a ton of times so it shouldn't've happened 😅

can be used without problems.

neftune_noise_alpha = 10 and AdaLora and LoftQ.

@NickyNicky this is not a conversation/multiturn type of model really, it's a single turn model, and newline is conditioning model to generate the responses here, that's also why the newline tokenization flag is needed during FT but not inference. eos token could be added maybe but not heavy chat templates

thank you so much.

close.

NickyNicky changed discussion status to closed

Hello @merve ,

I ran your code and fine-tuned Paligemma, but the output model is behaving strangely and replying with more questions. Here is the demo space: https://huggingface.co/spaces/taesiri/sample-paligemma-finetuned.

I am getting this warning when loading the model:

The tokenizer class you load from this checkpoint is 'LlamaTokenizer'. 
The class this function is called from is 'GemmaTokenizerFast'.

Are we sure that the training dataset format, tokenizer, and other configurations are set correctly? How can I debug this? Many thanks. 🤗🤗

@NickyNicky this is not a conversation/multiturn type of model really, it's a single turn model, and newline is conditioning model to generate the responses here, that's also why the newline tokenization flag is needed during FT but not inference. eos token could be added maybe but not heavy chat templates

Can you please put documentation on this and on how tokens are managed for training without depending on the Trainer wrapper ?

Google org

Hello, we have made a few changes which also include API changes around preprocessing for finetuning, you can refer to this notebook: https://colab.research.google.com/drive/1x_OEphRK0H97DqqxEyiMewqsTiLD_Xmi?usp=sharing

Sign up or log in to comment