Tom Aarsen

tomaarsen

AI & ML interests

NLP: text embeddings, named entity recognition, few-shot text classification

Articles

Organizations

tomaarsen's activity

replied to their post 3 days ago
view reply

Here's some evidence that these models work well across domains:
1715098336904.jpg

posted an update 3 days ago
view post
Post
1495
NuMind has just released 3 new state-of-the-art GLiNER models for Named Entity Recognition/Information Extraction. These GLiNER models allow you to specify any label that you want, and it'll find spans in the text corresponding to your label. It's been shown to work quite well on unusual domains, e.g. celestial entities in my picture.

There are 3 models released:
- numind/NuNER_Zero:
The primary model, SOTA & can detect really long entities.
- numind/NuNER_Zero-span:
Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens.
- numind/NuNER_Zero-4k:
Slightly worse than NuNER Zero, but has a context length of 4k tokens.

Some more details about these models in general:
- They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance.
- Because they're small - they're fast: <1s per sentence on free GPUs.
- They have an MIT license: free commercial usage.

Try out the demo here: numind/NuZero
Or check out all of the models here: numind/nunerzero-zero-shot-ner-662b59803b9b438ff56e49e2

If there's ever a need for me to extract some information from any text: I'll be using these. Great work @Serega6678 !
  • 3 replies
·
posted an update 10 days ago
view post
Post
2488
I've just stumbled upon some excellent work on (🇫🇷 French) retrieval models by @antoinelouis . Kudos to him!

- French Embedding Models: https://huggingface.co/collections/antoinelouis/dense-single-vector-bi-encoders-651523c0c75a3d4c44fc864d
- French Reranker Models: antoinelouis/cross-encoder-rerankers-651523f16efa656d1788a239
- French Multi-vector Models: https://huggingface.co/collections/antoinelouis/dense-multi-vector-bi-encoders-6589a8ee6b17c06872e9f075
- Multilingual Models: https://huggingface.co/collections/antoinelouis/modular-retrievers-65d53d0db64b1d644aea620c

A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.
replied to albertvillanova's post 14 days ago
view reply

Oooh, Dataset.take should be very convenient. No more .select(range(...)) 🚀

replied to Sentdex's post 14 days ago
view reply

I'm concerned about the low training speed (10x slower). Do we know anything about the inference latency as well? I think that's key to figure out whether this is viable or not.

replied to fdaudens's post 21 days ago
view reply

Thanks for writing out this list! I try my best to keep up, but even I missed some of these

replied to bwang0911's post 27 days ago
view reply

I quite enjoy the speed of these, well done.

replied to beomi's post 28 days ago
view reply

Nice job! What are your findings so far? Can you reasonably handle the lengths that they claim?

posted an update 28 days ago
view post
Post
3045
🚀 Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Details:

1️⃣ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

2️⃣ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a truncate_dim option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings from LlamaIndex or LangChain.

3️⃣ Additionally, you can now specify truncate_dim in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)

4️⃣ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support trust_remote_code to load models with custom modelling code.

5️⃣ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!

Check out the release notes for all of the details: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.7.0

I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
  • 2 replies
·
replied to jamarks's post about 1 month ago
view reply

Awesome! I reckon this'll make it a lot easier to quickly share, save & load some annotation work.

replied to louisbrulenaudet's post about 1 month ago
view reply

Very glad to see more uses of embedding quantization, great job.

replied to trisfromgoogle's post about 1 month ago
view reply

The Recurrent Gemma is very intriguing to me. I'm looking forward to reading more about the RNN-based models when I have some more spare time.

replied to urchade's post about 1 month ago
replied to MoritzLaurer's post about 1 month ago
view reply

Looking forward to your blogpost! It's always exciting to see solid non-generative models.

posted an update about 2 months ago
view post
Post
2651
🏅 Quantized Embeddings are here! Unlike model quantization, embedding quantization is a post-processing step for embeddings that converts e.g. float32 embeddings to binary or int8 embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!

Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!

Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co/blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
  • 1 reply
·
posted an update 2 months ago
view post
Post
🎉Today, the 5000th Sentence Transformer model was uploaded to Hugging Face! Embedding models are extremely versatile, so it's no wonder that they're still being trained.

Here's a few resources to get you started with them:
- All Sentence Transformer models: https://huggingface.co/models?library=sentence-transformers&sort=trending
- Sentence Transformer documentation: https://sbert.net/
- Massive Text Embedding Benchmark (MTEB) Leaderboard: mteb/leaderboard

The embedding space is extremely active right now, so if you're using an embedding model for your retrieval, semantic similarity, reranking, classification, clustering, etc., then be sure to keep an eye out on the trending Sentence Transformer models & new models on MTEB.

Also, I'm curious if you've ever used Sentence Transformers via a third party library, like a RAG framework or vector database. I'm quite interested in more integrations to bring everyone free, efficient & powerful embedding models!
replied to giux78's post 2 months ago
posted an update 2 months ago
view post
Post
I remember very well that about two years ago, 0-shot named entity recognition (i.e. where you can choose any labels on the fly) was completely infeasible. Fast forward a year, and Universal-NER/UniNER-7B-all surprised me by showing that 0-shot NER is possible! However, I had a bunch of concerns that prevented me from ever adopting it myself. For example, the model was 7B parameters, only worked with 1 custom label at a time, and it had a cc-by-nc-4.0 license.

Since then, a little known research paper introduced GLiNER, which was a modified & finetuned variant of the microsoft/deberta-v3-base line of models. Notably, GLiNER outperforms UniNER-7B, despite being almost 2 orders of magnitude smaller! It also allows for multiple labels at once, supports nested NER, and the models are Apache 2.0.

Very recently, the models were uploaded to Hugging Face, and I was inspired to create a demo for the English model. The demo runs on CPU, and can still very efficiently compute labels with great performance. I'm very impressed at the models.

There are two models right now:
* base (english): urchade/gliner_base
* multi (multilingual): urchade/gliner_multi

And my demo to experiment with the base model can be found here: https://huggingface.co/spaces/tomaarsen/gliner_base
·
replied to urchade's post 2 months ago
replied to their post 2 months ago
view reply

I've had the same idea before as well! I think this should work as well, but I haven't had time to do the research myself. Perhaps @SeanLee97 is interested in trying this out?

posted an update 3 months ago
view post
Post
🤗 Sentence Transformers v2.4.0 for embedding models is now out! It introduces a lot of powerful features, such as:

1. Matryoshka Loss function - you can now train & perform inference on 🪆 Matryoshka Embedding models. See also our blogpost: https://huggingface.co/blog/matryoshka

2. CoSENTLoss & AnglELoss: State of the art loss functions. These are quite interesting, they outperform CosineSimilarityLoss on nearly all benchmarks as a drop-in replacement! See also the docs: https://sbert.net/docs/package_reference/losses.html#cosentloss

3. Prompt templates: Many popular models such as intfloat/multilingual-e5-large and BAAI/bge-large-en-v1.5 prefix their texts with prompts, so this adds configuration options to automatically include prompts using model.encode(..., prompt_name="query") which will include a prompt with the name "query". More info in the docs: https://sbert.net/examples/applications/computing-embeddings/README.html#prompt-templates

4. Instructor support: Support for the INSTRUCTOR line of models, such as hkunlp/instructor-large. Learn how to use them here: https://sbert.net/docs/pretrained_models.html#instructor-models

5. Removed NLTK & sentencepiece dependencies: Should allow for a smaller installation & a slightly faster import!

6. Updated documentation: a new Loss Overview section: https://sbert.net/docs/training/loss_overview.html and more detailed loss functions: https://sbert.net/docs/package_reference/losses.html

And much more! See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.4.0

Some more very exciting updates are still on the horizon!
·
replied to Ali-C137's post 3 months ago
view reply

I've been working hard to get my HF Inbox down, but now my emails have started overflowing 🙃

replied to stas's post 3 months ago
replied to lbourdois's post 4 months ago
view reply

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

replied to their post 4 months ago
posted an update 4 months ago
view post
Post
Sentence Transformers v2.3.0 has been released! It includes several bug fixes, enhanced model loading including custom models & no more unnecessary file downloads, improved performance, a powerful loss function, and much more!

Details:
⬆ Uploading Models to the Hub with save_to_hub.
⬇ Downloading Models from the Hub now downloads only necessary files.
⚙ Custom Models (such as jinaai/jina-embeddings-v2-base-de) can now be loaded with trust_remote_code=True.
🔍 Models can now be loaded at specific revisions (e.g. commit hashes or git branches).
🖥️ Various device fixes; models will now always operate on the device that you specify.
📉 A new "Cached" variant of the powerful Multiple Negatives Ranking Loss allows common hardware to reach performance previously only accessible on multi-gpu clusters.
🐎 Computation time of Community Detection was decreased significantly (7x speedup at 500k sentences :exploding_head:)
🪶 Removed the now unnecessary "torchvision" dependency for a smaller installation.

Check out the full changelog here: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.3.0

I'll be working on much more changes in the near future, so expect more exciting updates. If you encounter any issues, or have any questions or feature requests, don't hesitate to open an issue on the repository: https://github.com/UKPLab/sentence-transformers/issues
  • 1 reply
·