Sentence Similarity
sentence-transformers
PyTorch
Safetensors
Transformers
English
mpnet
fill-mask
feature-extraction
Inference Endpoints
5 papers

Latency observed in Embedding computation

#4
by RajaRamKankipati - opened

Hi Team,

Implementing MPNET code for long documents which have more than 512 tokens in the following approach:

  1. Get all the tokens from the tokenizers without truncation
  2. Split the tokens in chunks of 512 and
  3. Pass the chunks to the model in a batch
encoded_input = tokenizer(
            document,
            max_length=None,
            padding=True,
            truncation=False,
            return_tensors="pt",
        ).to(device)

encoded_input = pre_processing_encoded_input(encoded_input, size = 512)       

# Compute token embeddings
with torch.no_grad():
      model_output = self.model(**encoded_input)

With a simple encoded_input of 512 tokens, the model takes around 230ms to compute the embedding, with the array shape (2, 512) taking 2000ms and increasing exponentially, is there any way I can achieve low latency using the model for long documents ?

Sign up or log in to comment