Hugging Face – Posts

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

All HF Hub posts

singhsidhukuldeep

posted an update about 5 hours ago

Post

267

How many times have you said Pandas is slow and still kept on using it? 🐼💨

Get ready to say Pandas can be fast but it's expensive 😂

🙌 Original Limitations:

💻 CPU-Bound Processing: Traditional pandas operations are CPU-bound (mostly single-threaded😰), leading to slower processing of large datasets.

🧠 Memory Constraints: Handling large datasets in memory-intensive operations can lead to inefficiencies and limitations.

𝌣 Achievements with @nvidia RAPIDS cuDF:

🚀 GPU Acceleration: RAPIDS cuDF leverages GPU computing. Users switch to GPU-accelerated operations without modifying existing pandas code.

🔄 Unified Workflows: Seamlessly integrates GPU and CPU operations, falling back to CPU when necessary.

📈 Optimized Performance: With extreme parallel operation opportunity of GPUs, this achieves up to 150x speedup in data processing, demonstrated through benchmarks like DuckDB.

😅New Limitations:

🎮 GPU Availability: Requires a GPU (not everything should need a GPU)

🔄 Library Compatibility: Currently in the initial stages, all the functionality cannot be ported

🐢 Data Transfer Overhead: Moving data between CPU and GPU can introduce latency if not managed efficiently. As some operations still run on the CPU.

🤔 User Adoption: We already had vectorization support in Pandas, people just didn't use it as it was difficult to implement. We already had DASK for parallelization. It's not that solutions didn't exist

Blog: https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/

For Jupyter Notebooks:

%load_ext cudf.pandas
import pandas as pd

For python scripts:

python -m cudf.pandas script.py

kaisugi

posted an update about 5 hours ago

Post

220

🚀 Stockmark-100b

Stockmark Inc. has developed and released one of Japan's largest commercial-scale Language Models (LLM) with 100 billion parameters, named "Stockmark-LLM-100b". This model significantly reduces hallucinations and provides accurate responses to complex business-related queries. Developed from scratch with a focus on Japanese business data, the model aims to be reliable for high-stakes business environments. It's open-source and available for commercial use.

Key highlights:
- The model reduces hallucinations—incorrect confident responses that AI models sometimes generate.
- Stockmark-LLM-100b can answer basic business questions and specialized queries in industries like manufacturing.
- The model's performance surpasses GPT-4-turbo in accuracy for business-specific queries.
- Evaluation benchmarks (VicunaQA) show high performance.
- Fast inference speed, generating 100-character Japanese text in 1.86 seconds.

stockmark/stockmark-100b
stockmark/stockmark-100b-instruct-v0.1

Detailed press release (in Japanese): https://stockmark.co.jp/news/20240516

2 replies

leonardlin

posted an update about 6 hours ago

Post

214

llm-jp-eval is currently one of the most widely used benchmarks for Japanese LLMs and is half of WandB's comprehensive Nejumi LLM Leaderboard scoring. I was seeing some weirdness in results I was getting and ended up in a bit of a rabbit hole. Here's my article on evaling llm-jp-eval: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval

I've setup a fork of Lightblue's Shaberi testing framework which uses LLM-as-a-Judge style benchmarks as something probably more representative of real world LLM strength in Japanese. Here's how the new base model ablations are looking:

Fredtt3

posted an update about 9 hours ago

Post

290

Nueva actualización a StockAI para que use CUDA: RiveraAI/StockAI 👾
🤖

mrfakename

posted an update about 11 hours ago

Post

354

Introducing StyleTTS 2 detector, an audio classification model to detect StyleTTS 2 vs human-generated content!

Dual-licensed under MIT/Apache 2.0.

Model Weights: mrfakename/styletts2-detector
Spaces: mrfakename/styletts2-detector

kadirnar

posted an update about 11 hours ago

Post

273

Midjourney + Custom SDXL-Lightning:

2 replies

not-lain

posted an update about 13 hours ago

Post

366

If you're a researcher or developing your own model 👀 you might need to take a look at huggingface's ModelHubMixin classes.
They are used to seamlessly integrate your AI model with huggingface and to save/ load your model easily 🚀

1️⃣ make sure you're using the appropriate library version

pip install -qU "huggingface_hub>=0.22"

2️⃣ inherit from the appropriate class

from huggingface_hub import PyTorchModelHubMixin
from torch import nn

class MyModel(nn.Module,PyTorchModelHubMixin):
  def __init__(self, a, b):
    super().__init__()
    self.layer = nn.Linear(a,b)
  def forward(self,inputs):
    return self.layer(inputs)

first_model = MyModel(3,1)

4️⃣ push the model to the hub (or use save_pretrained method to save locally)

first_model.push_to_hub("not-lain/test")

5️⃣ Load and initialize the model from the hub using the original class

pretrained_model = MyModel.from_pretrained("not-lain/test")

Salama1429

posted an update about 16 hours ago

Post

471

📚 Introducing the 101 Billion Arabic Words Dataset

🌐 Exciting Milestone in Arabic Language Technology! hashtag#NLP hashtag#ArabicLLM hashtag#LanguageModels

🚀 Why It Matters:
1. 🌟 Large Language Models (LLMs) have brought transformative changes, primarily in English. It's time for Arabic to shine!
2. 🎯 This project addresses the critical challenge of bias in Arabic LLMs due to reliance on translated datasets.

🔍 Approach:
1. 💪 Undertook a massive data mining initiative focusing exclusively on Arabic from Common Crawl WET files.
2. 🧹 Employed state-of-the-art cleaning and deduplication processes to maintain data quality and uniqueness.

📈 Impact:
1. 🏆 Created the largest Arabic dataset to date with 101 billion words.
2. 📝 Enables the development of Arabic LLMs that are linguistically and culturally accurate.
3. 🌍 Sets a global benchmark for future Arabic language research.

🔗 Paper: https://lnkd.in/dGAiaygn
🔗 Dataset: https://lnkd.in/dGTMe5QV

- 🔄 Share your thoughts and let's drive the future of Arabic NLP together!

hashtag#DataScience hashtag#MachineLearning hashtag#ArtificialIntelligence hashtag#Innovation hashtag#ArabicData

akhaliq

posted an update about 21 hours ago

Post

968

Chameleon

Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818)

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

merve

posted an update about 22 hours ago

Post

1029

I got asked about PaliGemma's document understanding capabilities, so I built a Space that has all the PaliGemma fine-tuned doc models 📄📊📖
merve/paligemma-doc

Recently active users