12 113 146

lhl PRO

leonardlin

https://randomfoo.net/

lhl

AI & ML interests

None yet

Articles

Evaling llm-jp-eval (evals are hard)

about 3 hours ago

• 2

Organizations

Posts 2

Post

141

llm-jp-eval is currently one of the most widely used benchmarks for Japanese LLMs and is half of WandB's comprehensive Nejumi LLM Leaderboard scoring. I was seeing some weirdness in results I was getting and ended up in a bit of a rabbit hole. Here's my article on evaling llm-jp-eval: https://huggingface.co/blog/leonardlin/llm-jp-eval-eval

I've setup a fork of Lightblue's Shaberi testing framework which uses LLM-as-a-Judge style benchmarks as something probably more representative of real world LLM strength in Japanese. Here's how the new base model ablations are looking:

Post

671

I've been doing some evals and tuning, and this chat template repo maintained by @chujiezheng is great: https://github.com/chujiezheng/chat_templates

Here's also a simple script for checking what the output looks like:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("augmxnt/shisa-7b-v1")
messages = [
    {'role': 'user', 'content': 'This is the first user input.'},
    {'role': 'assistant', 'content': 'This is the first assistant response.'},
    {'role': 'user', 'content': 'This is the second user input.'},
]

print()
print('Chat Template:')
print(tokenizer.chat_template)
print()
print('---')
print()

print(tokenizer.apply_chat_template(messages, tokenize=False))

Collections 19

spaces 1

Runtime error

💬

Shisa Ablations

models

None public yet

datasets

None public yet

lhl PRO

AI & ML interests

Articles

Evaling llm-jp-eval (evals are hard)

Organizations

Posts 2

Collections 19

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Accelerating LLM Inference with Staged Speculative Decoding

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

QuIP: 2-Bit Quantization of Large Language Models With Guarantees

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

spaces 1

Shisa Ablations

models

datasets