MS MARCO Mined Triplets Collection These datasets contain MS MARCO Triplets gathered by mining hard negatives using various models. Each dataset has various subsets. • 14 items • Updated 1 day ago • 1
Parallel Sentences Datasets Collection These datasets all have "english" and "non_english" columns for numerous datasets. They can be used to make embedding models multilingual. • 10 items • Updated 1 day ago • 1
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 49 items • Updated 1 day ago • 10
Piccolo2: General Text Embedding with Multi-task Hybrid Loss Training Paper • 2405.06932 • Published 5 days ago • 14
NuNerZero - Zero Shot NER Collection The best compact Zero-Shot NER models with MIT license • 4 items • Updated 6 days ago • 11
view article Article Train Custom Models on Hugging Face Spaces with AutoTrain SpaceRunner By abhishek • 7 days ago • 5
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • 17 days ago • 25
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • 13 days ago • 14
view article Article Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent 25 days ago • 71
🇫🇷 Cross-encoder rerankers Collection A collection of cross-encoder reranking models in French. • 31 items • Updated 10 days ago • 2
🇫🇷 Single-vector dense bi-encoders Collection A collection of single-vector dense representation models in French. • 15 items • Updated 6 days ago • 2
Llama3-ChatQA-1.5 Collection Llama3-ChatQA-1.5 models excel at conversational question answering (QA) and retrieval-augmented generation (RAG). • 6 items • Updated 13 days ago • 34
view article Article StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation 18 days ago • 68
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • 20 days ago • 54
Arctic Collection A collection of pre-trained dense-MoE Hybrid transformer models • 2 items • Updated 22 days ago • 18
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published 24 days ago • 230
LongEmbed: Extending Embedding Models for Long Context Retrieval Paper • 2404.12096 • Published 28 days ago • 2
Arctic-embed Collection A collection of text embedding models optimized for retrieval accuracy and efficiency • 5 items • Updated 29 days ago • 10
view article Article Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval Mar 22 • 34
Vector-io compatible Datasets Collection These datasets can be loaded into your vector database with a single line bash command • 14 items • Updated Mar 29 • 3
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer Paper • 2311.08526 • Published Nov 14, 2023 • 7
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 566
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper • 2402.13753 • Published Feb 21 • 104
BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights Paper • 2311.16075 • Published Nov 27, 2023 • 5
Canonical models Collection This collection lists all the historical (pre-"Hub") canonical model checkpoints, i.e. repos that were not under an org or user namespace • 68 items • Updated Feb 13 • 13
Improving Text Embeddings with Large Language Models Paper • 2401.00368 • Published Dec 31, 2023 • 72
Zeroshot Classifiers Collection These are my current best zeroshot classifiers. Some of my older models are downloaded more often, but the models in this collection are newer/better. • 11 items • Updated Apr 3 • 76
Language Resources for Dutch Large Language Modelling Paper • 2312.12852 • Published Dec 20, 2023 • 9
NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval Paper • 2310.14282 • Published Oct 22, 2023 • 5
Developing a Named Entity Recognition Dataset for Tagalog Paper • 2311.07161 • Published Nov 13, 2023 • 2
GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets Paper • 2311.09860 • Published Nov 16, 2023 • 5
NER in Spanish Collection Fine-tuned models to perform NER in Spanish using the framework SpanMarker and different encoders and datasets • 3 items • Updated 9 days ago • 4
Direct Preference Optimization: Your Language Model is Secretly a Reward Model Paper • 2305.18290 • Published May 29, 2023 • 37