569 47 191

Tom Aarsen

tomaarsen

https://linkedin.com/in/tomaarsen

tomaarsen

AI & ML interests

NLP: text embeddings, named entity recognition, few-shot text classification

Articles

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Apr 3

• 6

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Mar 22

• 34

Organizations

tomaarsen's activity

upvoted 2 articles 1 day ago

Article

PaliGemma – Google's Cutting-Edge Open Vision Language Model

3 days ago

• 79

Article

2024-04-22 - Hub Incident Post Mortem

•

1 day ago

• 11

upvoted 3 collections 1 day ago

MS MARCO Mined Triplets

Collection

These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets. • 14 items • Updated 1 day ago • 1

Parallel Sentences Datasets

Collection

These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 10 items • Updated 1 day ago • 1

Embedding Model Datasets

Collection

A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 49 items • Updated 1 day ago • 10

upvoted a paper 1 day ago

Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training

Paper • 2405.06932 • Published 5 days ago • 14

upvoted an article 2 days ago

Article

Hugging Face x LangChain : A new partner package in LangChain

3 days ago

• 44

upvoted a collection 3 days ago

NuNerZero - Zero Shot NER

Collection

The best compact Zero-Shot NER models with MIT license • 4 items • Updated 6 days ago • 11

upvoted an article 6 days ago

Article

Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner

•

7 days ago

• 5

upvoted 3 articles 9 days ago

Article

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

•

17 days ago

• 25

Article

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

•

13 days ago

• 14

Article

Synthetic data: save money, time and carbon with open source

Feb 16

• 21

upvoted 2 papers 10 days ago

What matters when building vision-language models?

Paper • 2405.02246 • Published 13 days ago • 70

Model Merging by Uncertainty-Based Gradient Matching

Paper • 2310.12808 • Published Oct 19, 2023 • 6

upvoted an article 10 days ago

Article

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

25 days ago

• 71

upvoted 2 collections 10 days ago

🇫🇷 Cross-encoder rerankers

Collection

A collection of cross-encoder reranking models in French. • 31 items • Updated 10 days ago • 2

🇫🇷 Single-vector dense bi-encoders

Collection

A collection of single-vector dense representation models in French. • 15 items • Updated 6 days ago • 2

upvoted a collection 11 days ago

Llama3-ChatQA-1.5

Collection

Llama3-ChatQA-1.5 models excel at conversational question answering (QA) and retrieval-augmented generation (RAG). • 6 items • Updated 13 days ago • 34

upvoted an article 16 days ago

Article

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

18 days ago

• 68

upvoted an article 20 days ago

Article

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

•

20 days ago

• 54

upvoted a collection 22 days ago

Arctic

Collection

A collection of pre-trained dense-MoE Hybrid transformer models • 2 items • Updated 22 days ago • 18

upvoted a paper 23 days ago

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Paper • 2404.14219 • Published 24 days ago • 230

upvoted a collection 23 days ago

Phi-3

Collection

Phi-3 family of models • 6 items • Updated 3 days ago • 196

upvoted a paper 23 days ago

LongEmbed: Extending Embedding Models for Long Context Retrieval

Paper • 2404.12096 • Published 28 days ago • 2

upvoted an article 28 days ago

Article

Welcome Llama 3 - Meta's new open LLM

29 days ago

• 238

upvoted a collection 29 days ago

Arctic-embed

Collection

A collection of text embedding models optimized for retrieval accuracy and efficiency • 5 items • Updated 29 days ago • 10

upvoted 3 articles about 1 month ago

Article

Mergoo: Efficiently Build Your Own MoE LLM

•

9 days ago

• 32

Article

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Mar 22

• 34

Article

Hugging Face partners with Wiz Research to Improve AI Security

Apr 4

• 10

upvoted a collection about 1 month ago

Vector-io compatible Datasets

Collection

These datasets can be loaded into your vector database with a single line bash command • 14 items • Updated Mar 29 • 3

upvoted a paper 2 months ago

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

Paper • 2311.08526 • Published Nov 14, 2023 • 7

upvoted 4 papers 3 months ago

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27 • 566

2D Matryoshka Sentence Embeddings

Paper • 2402.14776 • Published Feb 22 • 5

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Paper • 2402.13753 • Published Feb 21 • 104

BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights

Paper • 2311.16075 • Published Nov 27, 2023 • 5

upvoted a collection 4 months ago

Canonical models

Collection

This collection lists all the historical (pre-"Hub") canonical model checkpoints, i.e. repos that were not under an org or user namespace • 68 items • Updated Feb 13 • 13

upvoted a paper 4 months ago

Improving Text Embeddings with Large Language Models

Paper • 2401.00368 • Published Dec 31, 2023 • 72

upvoted a collection 4 months ago

Zeroshot Classifiers

Collection

These are my current best zeroshot classifiers. Some of my older models are downloaded more often, but the models in this collection are newer/better. • 11 items • Updated Apr 3 • 76

upvoted 2 papers 5 months ago

Language Resources for Dutch Large Language Modelling

Paper • 2312.12852 • Published Dec 20, 2023 • 9

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval

Paper • 2310.14282 • Published Oct 22, 2023 • 5

upvoted 2 papers 6 months ago

Developing a Named Entity Recognition Dataset for Tagalog

Paper • 2311.07161 • Published Nov 13, 2023 • 2

GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets

Paper • 2311.09860 • Published Nov 16, 2023 • 5

upvoted a collection 6 months ago

State-of-the-Art NER models - General purpose

Collection

5 items • Updated Feb 27 • 3

upvoted a collection 7 months ago

NER in Spanish

Collection

Fine-tuned models to perform NER in Spanish using the framework SpanMarker and different encoders and datasets • 3 items • Updated 9 days ago • 4

upvoted a paper 7 months ago

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper • 2305.18290 • Published May 29, 2023 • 37

upvoted 2 papers 8 months ago

Effective Long-Context Scaling of Foundation Models

Paper • 2309.16039 • Published Sep 27, 2023 • 28

Qwen Technical Report

Paper • 2309.16609 • Published Sep 28, 2023 • 30

Tom Aarsen

AI & ML interests

Articles

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

🪆 Introduction to Matryoshka Embedding Models

SetFitABSA: Few-Shot Aspect Based Sentiment Analysis using SetFit

🕳️ Attention Sinks in LLMs for endless fluency

Organizations

tomaarsen's activity

PaliGemma – Google's Cutting-Edge Open Vision Language Model

2024-04-22 - Hub Incident Post Mortem

Hugging Face x LangChain : A new partner package in LangChain

Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner

⚗️ 🧑🏼‍🌾 Let's grow some Domain Specific Datasets together

🧑‍⚖️ "Replacing Judges with Juries" using distilabel

Synthetic data: save money, time and carbon with open source

Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation

🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets

Welcome Llama 3 - Meta's new open LLM

Mergoo: Efficiently Build Your Own MoE LLM

Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval

Hugging Face partners with Wiz Research to Improve AI Security