Featured image

For Fall 2024, I am taking an image processing class. This assignment is to play around with diffusion models and create some optical illusions, as well as imlement a diffusion model from scratch.

Part A Link to heading

Project Overview Link to heading

In this project, I explored the capabilities of diffusion models using DeepFloyd, a two-stage model trained by Stability AI. The project consisted of several experiments investigating different aspects of diffusion models, from basic sampling to creating optical illusions.

Part 1: Experiments with Diffusion Models Link to heading

1.1 Sampling Parameters Investigation Link to heading

I first investigated how the number of sampling steps affects image generation quality using DeepFloyd’s two-stage process:

20 steps for both stages 20 steps for both stages: Produced the highest quality results with good prompt adherence and minimal artifacts

5 steps stage 1, 20 steps stage 2 5 steps (stage 1), 20 steps (stage 2): Generated high-quality images but with less prompt fidelity

20 steps stage 1, 5 steps stage 2 20 steps (stage 1), 5 steps (stage 2): Showed good prompt following but exhibited noticeable pattern artifacts

1.2 Forward Process and Denoising Experiments Link to heading

Forward Process Link to heading

I implemented the forward process to add controlled amounts of noise to images. Using the Berkeley Campanile as a test image, I demonstrated progressive noise addition:

Noise level 250 Noise level 500 Noise level 750 Progressive noise addition at t = [250, 500, 750]

Classical Denoising (Gaussian Blur) Link to heading

Applied to different noise levels, showing limited effectiveness:

Gaussian Blur t=250 Gaussian Blur t=500 Gaussian Blur t=750 Gaussian blur denoising results at t = [250, 500, 750]

One-Step Neural Denoising Link to heading

Demonstrated significantly better results than classical methods:

One-step denoising t=250 One-step denoising t=500 One-step denoising t=750 One-step neural denoising results at t = [250, 500, 750]

Iterative Neural Denoising Link to heading

Iterative denoising steps Progressive improvement through iterative denoising

Method comparison Comparison of different denoising methods

1.3 Diffusion Model Sampling and CFG Link to heading

Basic sampling results: Iterative denoising samples Five samples generated using iterative denoising

With Classifier-Free Guidance (CFG): CFG samples Five samples generated using CFG with γ=7

1.4 Image-to-Image Translation Link to heading

Basic Translation Examples Link to heading

Original Campanile and its translation: Original Campanile Translated Campanile

Car translation: Original Car Translated Car

Selfie translation: Original Selfie Translated Selfie

Anime Character Translations Link to heading

Pikachu: Original Pikachu Translated Pikachu

Conan: Original Conan Translated Conan

Doraemon: Original Doraemon Translated Doraemon

Hand-Drawn Image Translations Link to heading

First drawing: Hand-drawn 1 Translated Hand-drawn 1

Second drawing: Hand-drawn 2 Translated Hand-drawn 2

1.5 Inpainting Results Link to heading

Three different inpainting scenarios:

Inpainted Castle Osaka Castle inpainting

Inpainted Campanile Berkeley Campanile inpainting

Inpainted Selfie Selfie inpainting

1.6 Text-Conditional Image Translation Link to heading

Progressive translations guided by text prompts:

Campanile to Rocket Campanile gradually transformed into a rocket ship

Doraemon to Dog Doraemon transformed into a realistic dog

Pikachu to Pencil Pikachu transformed into a pencil sketch

1.7 Creative Applications Link to heading

Visual Anagrams Link to heading

Created three sets of optical illusions that show different images when flipped:

Rocket and Dog Anagram Anagram switching between rocket and dog

Coast and Man Anagram Anagram switching between Amalfi coast and a man

Old Man and Fire Anagram Anagram switching between old man and campfire

diamond and mountain Anagram switching between a diamond and mountain

Hybrid Images Link to heading

Images that reveal different content at different viewing distances:

Skull and Waterfall Hybrid image combining skull and waterfall

Rocket and Man Hybrid image combining rocket and man

Cat and Tree with Hat Hybrid image combining a cat and a tree

Technical Implementation Notes Link to heading

Throughout the project, I implemented several key algorithms:

  • Forward process for noise addition
  • Iterative denoising with classifier-free guidance
  • Visual anagram generation using averaged noise estimates
  • Hybrid image creation using frequency-based noise combination

Conclusion Link to heading

This project demonstrated the versatility and power of diffusion models in various image manipulation tasks. From basic denoising to creative applications like visual anagrams and hybrid images, the results showcase both the technical capabilities of these models and their potential for creative expression. The success of complex applications like inpainting and text-conditional image translation particularly highlights the model’s sophisticated understanding of image structure and content.

Part B: Training a Diffusion Model from Scratch Link to heading

In this assignment, I build a diffusion model from scratch and train it to generate handwritten digits.


Part 1: Implementing the UNet Link to heading

The UNet architecture is as follows:

UNet Architecture Diagram

Training Data Link to heading

Training images are generated by adding variable amounts of Gaussian noise. For each training image x, the noisy image x_noisy is created as:

x_noisy = x + sigma * noise

where sigma in {0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0} and noise is sampled from a standard normal distribution.

Here is an example of varying noise levels:

Noise Levels

Training Procedure Link to heading

The UNet is trained to minimize the L2 loss between the predicted noise and the actual noise added. Initially, the model is trained with sigma = 0.5

Training Results for sigma = 0.5 Link to heading

  • Loss Curve:
    Loss Curve

  • Sample Outputs:
    Epoch 1:
    Epoch 1 Results
    Epoch 5:
    Epoch 5 Results


Out-of-Distribution Testing Link to heading

The trained model’s performance decreases as the noise level sigma increases beyond the training range:

Out-of-Distribution Results


Part 2: Training a Diffusion Model Link to heading

In this section, I extend the UNet model to handle time-conditional noise levels, enabling a full diffusion model.

Time-Conditioned UNet Link to heading

The model is modified to include the timestep t as an additional conditioning input. t is normalized to the range [0, 1], embedded using a fully connected layer, and injected into the UNet’s architecture.

Time-Conditioned UNet Diagram

Results Link to heading

  • Training Loss Curve:
    Loss Curve

  • Generated Samples Across Epochs:
    Epoch 1:
    Epoch 1 Samples
    Epoch 5:
    Epoch 5 Samples
    Epoch 10:
    Epoch 10 Samples
    Epoch 15:
    Epoch 15 Samples
    Epoch 20:
    Epoch 20 Samples


Class-Conditioned UNet Link to heading

To further improve the model, I add class-conditioning, allowing the generation of specific digits. Class information is represented as a one-hot encoded vector and processed via a fully connected layer.

Example implementation of the modified architecture:

fc1_t = FCBlock(...) # fully connected blocks
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)

t1 = fc1_t(t) # timestep information
c1 = fc1_c(c) # class information
t2 = fc2_t(t) # timestep information
c2 = fc2_c(c) # class information

# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1

Results Link to heading

loss

samples samples samples

Here is a gif of the diffusion process!

gif

Bonus: CS180 Maskots Link to heading

Here are some maskots I generated for the class. The prompt was “a stuffed bear surfing while holding a camera”.

maskot maskot maskot