For Fall 2024, I am taking an image processing class. This assignment is to play around with diffusion models and create some optical illusions, as well as imlement a diffusion model from scratch.
Part A Link to heading
Project Overview Link to heading
In this project, I explored the capabilities of diffusion models using DeepFloyd, a two-stage model trained by Stability AI. The project consisted of several experiments investigating different aspects of diffusion models, from basic sampling to creating optical illusions.
Part 1: Experiments with Diffusion Models Link to heading
1.1 Sampling Parameters Investigation Link to heading
I first investigated how the number of sampling steps affects image generation quality using DeepFloyd’s two-stage process:
20 steps for both stages: Produced the highest quality results with good prompt adherence and minimal artifacts
5 steps (stage 1), 20 steps (stage 2): Generated high-quality images but with less prompt fidelity
20 steps (stage 1), 5 steps (stage 2): Showed good prompt following but exhibited noticeable pattern artifacts
1.2 Forward Process and Denoising Experiments Link to heading
Forward Process Link to heading
I implemented the forward process to add controlled amounts of noise to images. Using the Berkeley Campanile as a test image, I demonstrated progressive noise addition:
Progressive noise addition at t = [250, 500, 750]
Classical Denoising (Gaussian Blur) Link to heading
Applied to different noise levels, showing limited effectiveness:
Gaussian blur denoising results at t = [250, 500, 750]
One-Step Neural Denoising Link to heading
Demonstrated significantly better results than classical methods:
One-step neural denoising results at t = [250, 500, 750]
Iterative Neural Denoising Link to heading
Progressive improvement through iterative denoising
Comparison of different denoising methods
1.3 Diffusion Model Sampling and CFG Link to heading
Basic sampling results: Five samples generated using iterative denoising
With Classifier-Free Guidance (CFG): Five samples generated using CFG with γ=7
1.4 Image-to-Image Translation Link to heading
Basic Translation Examples Link to heading
Original Campanile and its translation:
Car translation:
Selfie translation:
Anime Character Translations Link to heading
Pikachu:
Conan:
Doraemon:
Hand-Drawn Image Translations Link to heading
First drawing:
Second drawing:
1.5 Inpainting Results Link to heading
Three different inpainting scenarios:
Osaka Castle inpainting
Berkeley Campanile inpainting
Selfie inpainting
1.6 Text-Conditional Image Translation Link to heading
Progressive translations guided by text prompts:
Campanile gradually transformed into a rocket ship
Doraemon transformed into a realistic dog
Pikachu transformed into a pencil sketch
1.7 Creative Applications Link to heading
Visual Anagrams Link to heading
Created three sets of optical illusions that show different images when flipped:
Anagram switching between rocket and dog
Anagram switching between Amalfi coast and a man
Anagram switching between old man and campfire
Anagram switching between a diamond and mountain
Hybrid Images Link to heading
Images that reveal different content at different viewing distances:
Hybrid image combining skull and waterfall
Hybrid image combining rocket and man
Hybrid image combining a cat and a tree
Technical Implementation Notes Link to heading
Throughout the project, I implemented several key algorithms:
- Forward process for noise addition
- Iterative denoising with classifier-free guidance
- Visual anagram generation using averaged noise estimates
- Hybrid image creation using frequency-based noise combination
Conclusion Link to heading
This project demonstrated the versatility and power of diffusion models in various image manipulation tasks. From basic denoising to creative applications like visual anagrams and hybrid images, the results showcase both the technical capabilities of these models and their potential for creative expression. The success of complex applications like inpainting and text-conditional image translation particularly highlights the model’s sophisticated understanding of image structure and content.
Part B: Training a Diffusion Model from Scratch Link to heading
In this assignment, I build a diffusion model from scratch and train it to generate handwritten digits.
Part 1: Implementing the UNet Link to heading
The UNet architecture is as follows:
Training Data Link to heading
Training images are generated by adding variable amounts of Gaussian noise. For each training image x, the noisy image x_noisy is created as:
x_noisy = x + sigma * noise
where sigma in {0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0} and noise
is sampled from a standard normal distribution.
Here is an example of varying noise levels:
Training Procedure Link to heading
The UNet is trained to minimize the L2 loss between the predicted noise and the actual noise added. Initially, the model is trained with sigma = 0.5
Training Results for sigma = 0.5 Link to heading
Loss Curve:
Sample Outputs:
Epoch 1:
Epoch 5:
Out-of-Distribution Testing Link to heading
The trained model’s performance decreases as the noise level sigma increases beyond the training range:
Part 2: Training a Diffusion Model Link to heading
In this section, I extend the UNet model to handle time-conditional noise levels, enabling a full diffusion model.
Time-Conditioned UNet Link to heading
The model is modified to include the timestep t as an additional conditioning input. t is normalized to the range [0, 1], embedded using a fully connected layer, and injected into the UNet’s architecture.
Results Link to heading
Training Loss Curve:
Generated Samples Across Epochs:
Epoch 1:
Epoch 5:
Epoch 10:
Epoch 15:
Epoch 20:
Class-Conditioned UNet Link to heading
To further improve the model, I add class-conditioning, allowing the generation of specific digits. Class information is represented as a one-hot encoded vector and processed via a fully connected layer.
Example implementation of the modified architecture:
fc1_t = FCBlock(...) # fully connected blocks
fc1_c = FCBlock(...)
fc2_t = FCBlock(...)
fc2_c = FCBlock(...)
t1 = fc1_t(t) # timestep information
c1 = fc1_c(c) # class information
t2 = fc2_t(t) # timestep information
c2 = fc2_c(c) # class information
# Follow diagram to get unflatten.
# Replace the original unflatten with modulated unflatten.
unflatten = c1 * unflatten + t1
# Follow diagram to get up1.
...
# Replace the original up1 with modulated up1.
up1 = c2 * up1 + t1
Results Link to heading
Here is a gif of the diffusion process!
Bonus: CS180 Maskots Link to heading
Here are some maskots I generated for the class. The prompt was “a stuffed bear surfing while holding a camera”.