Expected a one dimensional embeddings vector, got a multi-dimensional value

#11
by pmishra - opened

I am trying to get embedding vectors for a an array of strings. I expected the output of one string to be of the dimension of 1024, but i got an array of the dimension (1,n,1024) where n varies amongst different strings of the array. Can someone explain this behaviour?

WhereIsAI org

Which way did you use it? Could you attach your code here?

I guess you use the transormers' way. The n represents the padding sequence length.

Sure. I iused hugging face inference for the model embeddings. Here const embedding is the Hfinference instance-
Screenshot 2024-01-14 at 10.17.01 AM.png

WhereIsAI org

@pmishra hi, the obtained embeddings with shape (1, n, 1024) are all tokens' embeddings.

You can use the first token's (i.e., the CLS token) embedding as the sentence embedding, as follows:

vecs = val[:, 0, :]

Thank you for the response!
Is there a difference in extent of information captured by the [CLS] token and rest of the sentence tokens? Will this be enough to carry out vector search?

Sign up or log in to comment