257 4 47

Joao Gante

joaogante

https://github.com/gante

joao_gante

gante

AI & ML interests

None yet

Articles

Organizations

Posts 2

Post

1801

Adding a long prompt can help you fight LLM hallucinations. However, if you know exactly how you want your LLM output constrained, there are much better strategies! 💪

Did you know you can force your LLM to ALWAYS generate a valid JSON file? Or to follow a well-defined answer template? You can do that and more with the 🤗 transformers-compatible outlines library.

It doesn't only allow you to master your LLM -- your text generation application will also become faster! 🔥 The more constrained your text generation is, the bigger speedups you'll see!

Follow @remi and other outlines folks to stay on top of the constrained generation game 🧠

Post

Up to 3x faster LLM generation with no extra resources/requirements - ngram speculation has landed in 🤗 transformers! 🏎️💨

All you need to do is to add prompt_lookup_num_tokens=10 to your generate call, and you'll get faster LLMs 🔥

How does it work? 🤔

Start with assisted generation, where a smaller model generates candidate sequences. The net result is a significant speedup if the model agrees with the candidate sequences! However, we do require a smaller model trained similarly 😕

The idea introduced (and implemented) by Apoorv Saxena consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation therein as a candidate! No smaller model is required while still achieving significant speedups 🔥

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

Here is the code example that produces the outputs shown in the video: https://pastebin.com/bms6XtR4

Have fun 🤗

spaces 6

Sleeping

⚡

models 7

joaogante/tiny-random-gpt2-with-generation-config

Updated Mar 7 • 3.29k

joaogante/Mistral-7B-Instruct-v0.2-medusa-wikitext

Updated Jan 7 • 1

joaogante/TinyLlama-1.1B-Chat-v1.0-medusa-wikitext

Updated Jan 6 • 1

joaogante/Mistral-7B-Instruct-v0.2-medusa-vicuna

Updated Jan 5

joaogante/test_audio

Automatic Speech Recognition • Updated Sep 13, 2023 • 4

joaogante/test_text

Fill-Mask • Updated Jun 15, 2022 • 1

Joao Gante

AI & ML interests

Articles

Code Llama: Llama 2 learns to code

Assisted Generation: a new direction toward low-latency text generation

Faster Text Generation with TensorFlow and XLA

Organizations

Posts 2

spaces 6

Assisted Generation Demo

Medusa Maker

Assisted Generation Benchmarks

Generate Quality Improvement

Color Coded Text Generation

Tf Xla Generate Benchmarks

models 7

joaogante/tiny-random-gpt2-with-generation-config

joaogante/Mistral-7B-Instruct-v0.2-medusa-wikitext

joaogante/TinyLlama-1.1B-Chat-v1.0-medusa-wikitext

joaogante/Mistral-7B-Instruct-v0.2-medusa-vicuna

joaogante/test_audio

joaogante/test_text

joaogante/test_img

datasets 1

joaogante/assisted_generation

Joao Gante

AI & ML interests

Articles

Code Llama: Llama 2 learns to code

Assisted Generation: a new direction toward low-latency text generation

Faster Text Generation with TensorFlow and XLA

Organizations

Posts 2

spaces 6 Sort: Recently updated

Assisted Generation Demo

Medusa Maker

Assisted Generation Benchmarks

Generate Quality Improvement

Color Coded Text Generation

Tf Xla Generate Benchmarks

models 7 Sort: Recently updated

datasets 1

spaces 6

models 7