Diffusion Course documentation

Unit 4: Going Further with Diffusion Models

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Unit 4: Going Further with Diffusion Models

Welcome to Unit 4 of the Hugging Face Diffusion Models Course! In this unit, we will look at some of the many improvements and extensions to diffusion models appearing in the latest research. It will be less code-heavy than previous units have been and is designed to give you a jumping-off point for further research.

Start this Unit :rocket:

Here are the steps for this unit:

  • Make sure you’ve signed up for this course so that you can be notified when additional units are added to the course.
  • Read through the material below for an overview of the different topics covered in this unit.
  • Dive deeper into any specific topics with the linked videos and resources.
  • Explore the demo notebooks and then read the ‘What Next’ section for some project suggestions.

:loudspeaker: Don’t forget to join Discord, where you can discuss the material and share what you’ve made in the #diffusion-models-class channel.

Table of Contents

Faster Sampling via Distillation

Progressive distillation is a technique for taking an existing diffusion model and using it to train a new version of the model that requires fewer steps for inference. The ‘student’ model is initialized from the weights of the ‘teacher’ model. During training, the teacher model performs two sampling steps and the student model tries to match the resulting prediction in a single step. This process can be repeated multiple times, with the previous iteration’s student model becoming the teacher for the next stage. The result is a model that can produce decent samples in much fewer steps (typically 4 or 8) than the original teacher model. The core mechanism is illustrated in this diagram from the paper that introduced the idea:

image

Progressive Distillation illustrated (from the paper)

The idea of using an existing model to ‘teach’ a new model can be extended to create guided models where the classifier-free guidance technique is used by the teacher model and the student model must learn to produce an equivalent output in a single step based on an additional input specifying the targeted guidance scale. This further reduces the number of model evaluations required to produce high-quality samples. This video gives an overview of the approach.

NB: A distilled version of Stable Diffusion can be used here.

Key references:

Training Improvements

There have been several additional tricks developed to improve diffusion model training. In this section we’ve tried to capture the core ideas from recent papers. There is a constant stream of research coming out with additional improvements, so if you see a paper you feel should be added here please let us know!

image Figure 2 from the ERNIE-ViLG 2.0 paper

Key training improvements:

  • Tuning the noise schedule, loss weighting and sampling trajectories for more efficient training. An excellent paper exploring some of these design choices is Elucidating the Design Space of Diffusion-Based Generative Models by Karras et al.
  • Training on diverse aspect ratios, as described in this video from the course launch event.
  • Cascaded diffusion models, training one model at low resolution and then one or more super-res models. Used in DALLE-2, Imagen and more for high-resolution image generation.
  • Better conditioning, incorporating rich text embeddings (Imagen uses a large language model called T5) or multiple types of conditioning (eDiffi).
  • ‘Knowledge Enhancement’ - incorporating pre-trained image captioning and object detection models into the training process to create more informative captions and produce better performance (ERNIE-ViLG 2.0).
  • ‘Mixture of Denoising Experts’ (MoDE) - training different variants of the model (‘experts’) for different noise levels as illustrated in the image above from the ERNIE-ViLG 2.0 paper.

Key references:

More Control for Generation and Editing

In addition to training improvements, there have been several innovations in the sampling and inference phase, including many approaches that can add new capabilities to existing diffusion models.

image Samples generated by ‘paint-with-words’ (eDiffi)

The video ‘Editing Images with Diffusion Models’ gives an overview of the different methods being used to edit existing images with diffusion models. The available techniques can be split into four main categories:

1) Add noise and then denoise with a new prompt. This is the idea behind the img2img pipeline, which has been modified and extended in various papers:

The paper InstructPix2Pix: Learning to Follow Image Editing Instructions is notable in that it used some of the image editing techniques described above to build a synthetic dataset of image pairs alongside image edit instructions (generated with GPT3.5) to train a new model capable of editing images based on natural language instructions.

Video

image Still frames from sample videos generated with Imagen Video

A video can be represented as a sequence of images, and the core ideas of diffusion models can be applied to these sequences. Recent work has focused on finding appropriate architectures (such as ‘3D UNets’ which operate on entire sequences) and on working efficiently with video data. Since high-frame-rate video involves a lot more data than still images, current approaches tend to first generate low-resolution and low-frame-rate video and then apply spatial and temporal super-resolution to produce the final high-quality video outputs.

Key references:

Audio

image

A spectrogram generated with Riffusion (image source)

While there has been some work on generating audio directly using diffusion models (e.g. DiffWave) the most successful approach so far has been to convert the audio signal into something called a spectrogram, which effectively ‘encodes’ the audio as a 2D “image” which can then be used to train the kinds of diffusion models we’re used to using for image generation. The resulting generated spectrograms can then be converted into audio using existing methods. This approach is behind the recently-released Riffusion, which fine-tuned Stable Diffusion to generate spectrograms conditioned on text - try it out here.

The field of audio generation is moving extremely quickly. Over the past week (at the time of writing) there have been at least 5 new advances announced, which are marked with a star in the list below:

Key references:

New Architectures and Approaches - Towards ‘Iterative Refinement’

image

Figure 1 from the Cold Diffusion paper

We are slowly moving beyond the original narrow definition of “diffusion” models and towards a more general class of models that perform iterative refinement, where some form of corruption (like the addition of gaussian noise in the forward diffusion process) is gradually reversed to generate samples. The ‘Cold Diffusion’ paper demonstrated that many other types of corruption can be iteratively ‘undone’ to generate images (examples shown above), and recent transformer-based approaches have demonstrated the effectiveness of token replacement or masking as a noising strategy.

image

Pipeline from MaskGIT

The UNet architecture at the heart of many current diffusion models is also being replaced with different alternatives, most notably various transformer-based architectures. In Scalable Diffusion Models with Transformers (DiT) a transformer is used in place of the UNet for a fairly standard diffusion model approach, with excellent results. Recurrent Interface Networks applies a novel transformer-based architecture and training strategy in pursuit of additional efficiency. MaskGIT and MUSE use transformer models to work with tokenized representations of images, although the Paella model demonstrates that a UNet can also be applied successfully to these token-based regimes too.

With each new paper, more efficient approaches are being developed, and it may be some time before we see what peak performance looks like on these kinds of iterative refinement tasks. There is much more still to explore!

Key references

Hands-On Notebooks

Chapter Colab Kaggle Gradient Studio Lab
DDIM Inversion Open In Colab Kaggle Gradient Open In SageMaker Studio Lab
Diffusion for Audio Open In Colab Kaggle Gradient Open In SageMaker Studio Lab

We’ve covered a LOT of different ideas in this unit, many of which deserve much more detailed follow-on lessons in the future. For now, you can choose two of the many topics via the hands-on notebooks we’ve prepared.

  • DDIM Inversion shows how a technique called inversion can be used to edit images using existing diffusion models.
  • Diffusion for Audio introduces the idea of spectrograms and shows a minimal example of fine-tuning an audio diffusion model on a specific genre of music.

Where Next?

This is the final unit of this course for now, which means that what comes next is up to you! Remember that you can always ask questions and chat about your projects on the Hugging Face Discord. We look forward to seeing what you create 🤗.