IndexError while using with language "japanese" (transcribe)

#16
by KKotaki - opened

An error occurs in the following chord in rare cases.
Please let me know the solution.

There was no difference in the shape of the input from other voice data.

code:

device = 'cuda'
model_path = 'openai/whisper-medium'

model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path, language="Japanese", task="transcribe")

model.config.forced_decoder_ids = self._processor.get_decoder_prompt_ids( language = "ja", task = "transcribe")
model.config.suppress_tokens = []
model.to(device)

inputs = processor.feature_extractor(
            audio_data,
            return_tensors="pt",
            sampling_rate=16_000
).input_features.to(device)
print(inputs.shape)
# torch.Size([1, 80, 3000])

predicted_ids = self._model.generate(
            inputs,
            max_length=sample_rate * 30,
           forced_decoder_ids=model.config.forced_decoder_ids
)
# error occured!!
# I don't think "forced_decoder_ids=..." needs to be done, but for some reason the language is "fi", so I specify it explicitly.

error:

Traceback (most recent call last):
  File "/xxx/infer.py", line 68, in transcribe
    predicted_ids = self._model.generate(
  File "/xxx/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/xxx/lib/python3.9/site-packages/transformers/generation/utils.py", line 1391, in generate
    return self.greedy_search(
  File "/xxx/lib/python3.9/site-packages/transformers/generation/utils.py", line 2189, in greedy_search
    next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0

Hey @KKotaki ! Thanks for reporting this! It looks like there's an issue with the generation code. Could you open an issue in HF Transformers, sharing the code, audio file and full error trace? We'll then be able to discuss the issue with you and propose a fix 🤗 Thank you! Issue link here: https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.yml

Taking a closer look at your code, I notice that you have max_length set to 30 * 16000 - it's worth noting that max length corresponds to the maximum length of the output text tokens (rather than the audio inputs), so you can set this to ~256.

See https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.max_length

With the latest version of transformers (4.29), you can omit forced_decoder_ids and pass language="japanese" directly to generate

If these two suggestions do not work, I'd recommend opening an issue as described above!

@sanchit-gandhi
Thank you for your response!
I will try and reply to you regarding the points you raised.

Sign up or log in to comment