philschmid HF staff commited on
Commit
94d995f
β€’
1 Parent(s): d7024c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -3
README.md CHANGED
@@ -7,11 +7,39 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Text-Generation-Inference is a Rust, Python and gRPC server for text generation inference. Used in production at [HuggingFace](https://huggingface.co)
11
- to power LLMs api-inference widgets.
 
 
 
 
 
 
 
 
 
12
 
13
  <img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## Check out the source code πŸ‘‰
16
  - the server backend: https://github.com/huggingface/text-generation-inference
17
- - the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ Text-Generation-Inference is, an open-source, purpose-built solution for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5. Text Generation Inference is already used by customers such as IBM, Grammarly, and the Open-Assistant initiative implements optimization for all supported model architectures, including:
11
+
12
+ - Tensor Parallelism and custom cuda kernels
13
+ - OOptimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
14
+ - Quantization with bitsandbytes or gptq
15
+ - Continuous batching of incoming requests for increased total throughput
16
+ - Accelerated weight loading (start-up time) with safetensors
17
+ - Logits warpers (temperature scaling, topk, repetition penalty ...)
18
+ - Watermarking with A Watermark for Large Language Models
19
+ - Stop sequences, Log probabilities
20
+ - Token streaming using Server-Sent Events (SSE)
21
 
22
  <img width="300px" src="https://huggingface.co/spaces/text-generation-inference/README/resolve/main/architecture.jpg" />
23
 
24
+ ## Currently optimized architectures
25
+
26
+ - [BLOOM](https://huggingface.co/bigscience/bloom)
27
+ - [FLAN-T5](https://huggingface.co/google/flan-t5-xxl)
28
+ - [Galactica](https://huggingface.co/facebook/galactica-120b)
29
+ - [GPT-Neox](https://huggingface.co/EleutherAI/gpt-neox-20b)
30
+ - [Llama](https://github.com/facebookresearch/llama)
31
+ - [OPT](https://huggingface.co/facebook/opt-66b)
32
+ - [SantaCoder](https://huggingface.co/bigcode/santacoder)
33
+ - [Starcoder](https://huggingface.co/bigcode/starcoder)
34
+ - [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b)
35
+ - [Falcon 40B](https://huggingface.co/tiiuae/falcon-40b)
36
+
37
  ## Check out the source code πŸ‘‰
38
  - the server backend: https://github.com/huggingface/text-generation-inference
39
+ - the Chat UI: https://huggingface.co/spaces/text-generation-inference/chat-ui
40
+
41
+ ## Check out examples
42
+
43
+ - [Introducing the Hugging Face LLM Inference Container for Amazon SageMaker](https://huggingface.co/blog/sagemaker-huggingface-llm)
44
+ - [Deploy LLMs with Hugging Face Inference Endpoints](https://huggingface.co/blog/inference-endpoints-llm)
45
+