Accelerate Transformers on State of the Art Hardware
Hugging Face is partnering with leading AI Hardware accelerators to make state of the art production performance accessible
Meet the Hugging Face Hardware Partners
Optimum: the ML Optimization toolkit for production performance
Hardware-specific acceleration tools
1. Quantize
Make models faster with minimal impact on accuracy, leveraging post-training quantization, quantization-aware training and dynamic quantization from Intel® Neural Compressor.
huggingface@hardware:~
from transformers import AutoModelForQuestionAnswering
from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer, INCModelForQuestionAnswering
model_name = "distilbert-base-cased-distilled-squad"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# The directory where the quantized model will be saved
save_dir = "quantized_model"
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = PostTrainingQuantConfig(approach="dynamic")
quantizer = INCQuantizer.from_pretrained(model)
# Apply dynamic quantization and save the resulting model
quantizer.quantize(quantization_config=quantization_config, save_directory=save_dir)
# Load the resulting quantized model, which can be hosted on the HF hub or locally
loaded_model = INCModelForQuestionAnswering.from_pretrained(save_dir)
2. Prune
Make models smaller with minimal impact on accuracy, with easy to use configurations to remove model weights using Intel® Neural Compressor.
huggingface@hardware:~
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from neural_compressor import QuantizationAwareTrainingConfig
from optimum.intel.neural_compressor import INCTrainer
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load the quantization configuration detailing the quantization we wish to apply
quantization_config = QuantizationAwareTrainingConfig()
trainer = INCTrainer(model, quantization_config=quantization_config, args=trainings_args)
# Train the model while applying quantization
trainer.train()
# Save the model and/or push to hub
trainer.save_model()
trainer.push_to_hub()
3. Train
Train models faster than ever before with Graphcore Intelligence Processing Units (IPUs), the latest generation of AI dedicated hardware, leveraging the built-in IPUTrainer API to train or finetune transformers models (coming soon)
huggingface@hardware:~
from optimum.graphcore import IPUConfig, IPUTrainer
from transformers import BertForPreTraining, BertTokenizer
# Allocate model and tokenizer as usual
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
model = BertForPreTraining.from_pretrained("bert-base-cased")
# IPU configuration + Trainer
ipu_config = IPUConfig.from_pretrained("Graphcore/bert-base-ipu")
trainer = IPUTrainer(model, ipu_config=ipu_config, args=trainings_args)
# The Trainer takes care of compiling the model for the IPUs in the background
# to perform training, the user does not have to deal with that
trainer.train()
# Save the model and/or push to hub
model.save_pretrained("...")
model.push_to_hub("...")