Skip to main content

The Power of Diffusion Models

·944 words·5 mins

For Fall 2024, I am taking an image processing class. This assignment is to play around with diffusion models and create some optical illusions, as well as imlement a diffusion model from scratch.

Part A
#

Project Overview
#

In this project, I explored the capabilities of diffusion models using DeepFloyd, a two-stage model trained by Stability AI. The project consisted of several experiments investigating different aspects of diffusion models, from basic sampling to creating optical illusions.

Part 1: Experiments with Diffusion Models
#

1.1 Sampling Parameters Investigation
#

I first investigated how the number of sampling steps affects image generation quality using DeepFloyd’s two-stage process:

20 steps for both stages
20 steps for both stages: Produced the highest quality results with good prompt adherence and minimal artifacts

5 steps stage 1, 20 steps stage 2
5 steps (stage 1), 20 steps (stage 2): Generated high-quality images but with less prompt fidelity

20 steps stage 1, 5 steps stage 2
20 steps (stage 1), 5 steps (stage 2): Showed good prompt following but exhibited noticeable pattern artifacts

1.2 Forward Process and Denoising Experiments
#

Forward Process
#

I implemented the forward process to add controlled amounts of noise to images. Using the Berkeley Campanile as a test image, I demonstrated progressive noise addition:

Noise level 250
Noise level 500
Noise level 750
Progressive noise addition at t = [250, 500, 750]

Classical Denoising (Gaussian Blur)
#

Applied to different noise levels, showing limited effectiveness:

Gaussian Blur t=250
Gaussian Blur t=500
Gaussian Blur t=750
Gaussian blur denoising results at t = [250, 500, 750]

One-Step Neural Denoising
#

Demonstrated significantly better results than classical methods:

One-step denoising t=250
One-step denoising t=500
One-step denoising t=750
One-step neural denoising results at t = [250, 500, 750]

Iterative Neural Denoising
#

Iterative denoising steps
Progressive improvement through iterative denoising

Method comparison
Comparison of different denoising methods

1.3 Diffusion Model Sampling and CFG
#

Basic sampling results:

Iterative denoising samples
Five samples generated using iterative denoising

With Classifier-Free Guidance (CFG):

CFG samples
Five samples generated using CFG with γ=7

1.4 Image-to-Image Translation
#

Basic Translation Examples
#

Original Campanile and its translation:

Original Campanile
Translated Campanile

Car translation:

Original Car
Translated Car

Selfie translation:

Original Selfie
Translated Selfie

Anime Character Translations
#

Pikachu:

Original Pikachu
Translated Pikachu

Conan:

Original Conan
Translated Conan

Doraemon:

Original Doraemon
Translated Doraemon

Hand-Drawn Image Translations
#

First drawing:

Hand-drawn 1
Translated Hand-drawn 1

Second drawing:

Hand-drawn 2
Translated Hand-drawn 2

1.5 Inpainting Results
#

Three different inpainting scenarios:

Inpainted Castle
Osaka Castle inpainting

Inpainted Campanile
Berkeley Campanile inpainting

Inpainted Selfie
Selfie inpainting

1.6 Text-Conditional Image Translation
#

Progressive translations guided by text prompts:

Campanile to Rocket
Campanile gradually transformed into a rocket ship

Doraemon to Dog
Doraemon transformed into a realistic dog

Pikachu to Pencil
Pikachu transformed into a pencil sketch

1.7 Creative Applications
#

Visual Anagrams
#

Created three sets of optical illusions that show different images when flipped:

Rocket and Dog Anagram
Anagram switching between rocket and dog

Coast and Man Anagram
Anagram switching between Amalfi coast and a man

Old Man and Fire Anagram
Anagram switching between old man and campfire

diamond and mountain
Anagram switching between a diamond and mountain

Hybrid Images
#

Images that reveal different content at different viewing distances:

Skull and Waterfall
Hybrid image combining skull and waterfall

Rocket and Man
Hybrid image combining rocket and man

Cat and Tree with Hat
Hybrid image combining a cat and a tree

Technical Implementation Notes
#

Throughout the project, I implemented several key algorithms:

  • Forward process for noise addition
  • Iterative denoising with classifier-free guidance
  • Visual anagram generation using averaged noise estimates
  • Hybrid image creation using frequency-based noise combination

Conclusion
#

This project demonstrated the versatility and power of diffusion models in various image manipulation tasks. From basic denoising to creative applications like visual anagrams and hybrid images, the results showcase both the technical capabilities of these models and their potential for creative expression. The success of complex applications like inpainting and text-conditional image translation particularly highlights the model’s sophisticated understanding of image structure and content.

Part B: Training a Diffusion Model from Scratch
#

In this assignment, I build a diffusion model from scratch and train it to generate handwritten digits.


Part 1: Implementing the UNet
#

The UNet architecture is as follows:

UNet Architecture Diagram

Training Data
#

Training images are generated by adding variable amounts of Gaussian noise. For each training image x, the noisy image x_noisy is created as:

x_noisy = x + sigma * noise

where sigma in {0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0} and noise is sampled from a standard normal distribution.

Here is an example of varying noise levels:

Noise Levels

Training Procedure
#

The UNet is trained to minimize the L2 loss between the predicted noise and the actual noise added. Initially, the model is trained with sigma = 0.5

Training Results for sigma = 0.5
#

  • Loss Curve:

    Loss Curve

  • Sample Outputs:
    Epoch 1:

    Epoch 1 Results

    Epoch 5:
    Epoch 5 Results


Out-of-Distribution Testing
#

The trained model’s performance decreases as the noise level sigma increases beyond the training range:

Out-of-Distribution Results


Part 2: Training a Diffusion Model
#

In this section, I extend the UNet model to handle time-conditional noise levels, enabling a full diffusion model.

Time-Conditioned UNet
#

The model is modified to include the timestep t as an additional conditioning input. t is normalized to the range [0, 1], embedded using a fully connected layer, and injected into the UNet’s architecture.

Time-Conditioned UNet Diagram

Results
#

  • Training Loss Curve:

    Loss Curve

  • Generated Samples Across Epochs:
    Epoch 1:

    Epoch 1 Samples

    Epoch 5:
    Epoch 5 Samples

    Epoch 10:
    Epoch 10 Samples

    Epoch 15:
    Epoch 15 Samples

    Epoch 20:
    Epoch 20 Samples


Class-Conditioned UNet
#

To further improve the model, I add class-conditioning, allowing the generation of specific digits. Class information is represented as a one-hot encoded vector and processed via a fully connected layer.

Example implementation of the modified architecture:

fc1_t = FCBlock(...) # fully connected blocks
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t) # timestep information
c1 = fc1_c(c) # class information
t2 = fc2_t(t) # timestep information
c2 = fc2_c(c) # class information

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1

Results
#

loss

samples
samples
samples

Here is a gif of the diffusion process!

gif

Bonus: CS180 Maskots
#

Here are some maskots I generated for the class. The prompt was “a stuffed bear surfing while holding a camera”.

maskot
maskot
maskot