Project 5: Fun With Diffusion Models!

Part 0: Setup

I play around with some different prompts and embeddings, try out 20 vs 100 num inference steps, and use a random seed of 53. Is a happy berkeley student an oxymoron? :P 20 steps is the first row and 100 is the second The Berkeley student with greater inference steps has more high frequency details like the bricks on the ground. Only the oasis with more inference steps has water, and the color of the water is somewhat realistic. The witch with more inference steps also has more details in the hair and background.

a happy berkeley student

a photo of an oasis

an oil painting of a witch

Part 1: Sampling Loops

1.1 Implementing the Forward Process

I use torch.randn_like(image) to add noise distributed to a standard normal distribution that is the same shape as the image, and index into alphas_cumprod accordingly to get the alpha values at timestep t. By running forward(im, t), I am able to produce a noisy version of the image at timestep t, where greater values of t result in a more noisy image.

original

noise level 250

noise level 500

noise level 750

1.2 Classical Denoising

I try denoising using classical methods: with torchvision.transforms.functional.gaussian_blur to remove the higher frequency components. I use kernel_size of 10 and the default calculated sigma. This is pretty ineffective as shown below. This is why we’re moving beyond classical methods into diffusion models!

noise level 250

noise level 500

noise level 750

Gaussian blurred noise level 250

Gaussian blurred noise level 500

Gaussian blurred noise level 750

1.3 One-Step Denoising

The idea behind diffusion models is the following: it’s easy to add noise to an image, so our model aims to “reverse” this process by denoising the image. The idea behind one-step denoising is estimating the noise within an image using a pretrained diffusion model and removing it to obtain a less noisy image. We use the text embedding of “a high quality photo”. After estimating the noise, I rearrange the equation to solve for the clean image, which requires subtracting the scaled noise and dividing the result by the sqrt of alpha.

noise level 250

noise level 500

noise level 750

one step denoised noise level 250

one step denoised level 500

one step denoised level 750

1.4 Iterative Denoising

Later, using strided timesteps that are too large causes issues. But we don’t worry about that here! The idea behind strided timesteps is to save time by computing multiple steps at once. I use the strided_timesteps denoted in this problem with stride of 30. The iteratively denoised campanile seems much more clear, but also more deviated from the original. The one steps is more blurry, and the Gaussian blurred campanile still has a lot of colorful noise.

t=90

t=240

t=390

t=540

t=690

original

final iteratively denoised

one step denoised

gaussian blurred

1.5 Diffusion Model Sampling

To generate random photos, I use the prompt “a high quality photo” and generate random noise in the shape of the image torch.randn. Then, I use my iterative_denoise function to generate the following images. At the beginning, a lot of people were being generated, which freaked me out so I generated more images.

sample 1

sample 2

sample 3

sample 4

sample 5

1.6 Classifier-Free Guidance (CFG)

To get more high quality images, we generate an unconditional noise estimate in addition to a conditional noise estimate, calling UNet twice, and use a combination of these two to generate a noise estimate. You can tell that the images in this section are more high quality than the previous, and overall much more vibrant!

sample 1

sample 2

sample 3

sample 4

sample 5

1.7 Image-to-image Translation

I run iterative_denoise_cfg with different noise levels, [1, 3, 5, 7, 10, 20]. As the noise level increases, we go from the “high quality photo” closer and closer to the campanile. It is interesting to see how the general shape and background look good in the beginning, but the details get closer to the original image as we add more noise.

original image

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

original image

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

original image

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

1.7.1 Editing Hand-Drawn and Web Images

I now try using images from online or hand drawn images. It looks like the model may have included some nudity for the drawing of the guy with the spiky hair so I redacted one of the images :/ It’s interesting to see how we go from people to more mushroom shaped to my mushroom. For the dino, the evolution seems to be mostly in the colors, and it’s funny how we go from realistic people to a penguin.

spiky hair man

noise 1

noise 3 (redacted)

noise 5

noise 7

noise 10

noise 20

mushroom

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

dino

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

penguin

noise 1

noise 3

noise 5

noise 7

noise 10

noise 20

1.7.2 Inpainting

For inpainting, I make minor edits to the iterative_denoise_cfg function. I add extra code to compute the noised original at this timestep and another line to utilize the mask when constructing the new image. I tried the pumpkin image twice and it filled it in with a pumpkin twice! I have to alter the step size within strided timestpes to be 10 to achieve better results!

og campanile

mask

to replace

final image

og pumpkin

mask

to replace

final image

original image

mask

to replace

final image

1.7.3 Text-Conditional Image-to-image Translation

For this step, I use a text prompt to guide a given image to also look like a text prompt. I call iterative_denoise_cfg with a different value for the noise and the greater the noise, the closer the image is to the original image. Prompts I used: "a christmas themed dog", "a photo of an oasis", "an oil painting of a princess". It is interesting to observe how the positioning and coloring of the dog is more like a pumpkin, and how the background of the oasis resembles the oasis more. The princess and campanile interpretation is quite interesting, I'm not sure why the arms are still out though!

og image: pumpkin

dog noise 1

dog noise 3

dog noise 5

dog noise 7

dog noise 10

dog noise 20

og image: sunset

oasis noise 1

oasis noise 3

oasis noise 5

oasis noise 7

oasis noise 10

oasis noise 20

og image: campanile

princess noise 1

princess noise 3

princess noise 5

princess noise 7

princess noise 10

princess noise 20

1.8 Visual Anagrams

I create visual anagrams which look like two different prompts right side up or upside down by obtaining two noise estimates, each with different orientation and prompt. I then flip the second noise estimate which corresponds to the flipped image, and average these two noise estimates. Here’s my two results and prompts!

an oil painting of two cats

an oil painting of two chess pieces

an oil painting of an urban city

an oil painting of a witch

1.9 Hybrid Images

To create hybrid images, I use a similar approach to part 1.8. Except these two noise estimates are computed on the image right side up with different prompts. I then use torchvision.transforms.function.gaussian_blur to get the low frequency and high frequency components of the noise estimates (low by just calling gaussian_blur on the noise, and the high by subtracting the gaussian_blur value from the original noise estimate). Then, I add these two noise estimates together to get the final hybrid images!

near: an oil painting of two chess pieces

far: an oil painting of two cats

near: an oil painting of an urban city

far: an oil painting of a witch

Part B: Training a Diffusion Model

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

Alright! In this part, I build a UNet based on the architecture shown below. I build standard operations, that are pieced together to create our final architecture.

1.2 Using the UNet to Train a Denoiser

Now that our model is built, we must train it! To train our model, we use MNIST digits. To learn how to load the images from MNIST I used this video. I created a function to add noise based on a noise level for images–results are visualized below.

level 0

level 0.2

level 0.4

level 0.6

level 0.8

level 1

1.2.1 Training

While training with noise level 0.5, I add noise to each image, then use the model to predict the clean image corresponding to the noisy image. To calculate loss we take L2 loss over the clean image and the predicted clean image. We can see that epoch 5 has much more clear digits than epoch 1 which makes sense, because more training has happened by then!

loss curve

input image

epoch 1

epoch 5

input image

epoch 1

epoch 5

input image

epoch 1

epoch 5

1.2.2 Out-of-Distribution Testing

Here, we sample out-of-distribution noise levels after the model is trained. The results on the number 7 for these two examples seem mostly legible until noise levels 0.8 and 1, where the resulting image features artifacts in the 7.

level 0

level 0.2

level 0.4

level 0.6

level 0.8

level 1

denoised results

level 0

level 0.2

level 0.4

level 0.6

level 0.8

level 1

1.2.3 Denoising Pure Noise

We do an interesting experiment here, training the model with input images of pure noise. Because we’re minimizing mean squared error, we’re finding the “centroid” of all of our training images, or our average digit. This is a process of generating the average image because we start from noise. We can see that our average digit has a lot of curves, perhaps from lots of numbers having curves within them. Our image at 5 epochs looks smoother and it seems like we converge on the average image by then!

input image

epoch 1

epoch 5

input image

epoch 1

epoch 5

input image

epoch 1

epoch 5

Part 2: Training a Flow Matching Model

2.1 Adding Time Conditioning to UNet

Now, instead of predicting the noisy image from a clean image, we predict the flow, or the velocity from a noisy image to a clean image. We do this by adding time conditioning to UNet with FCBlocks. Here is a diagram showing the new time conditioned unet and the FCBlock that is used to accomplish this.

2.2 Training the UNet

Now that we have our time conditioned UNet, we have to train it! We use the following algorithm for training, where we randomly choose a training image and timestep from 0 to 1, add noise to get a noised image, and train our model to predict the flow at a certain timestep. I use a batch size of 64, an exponential learning rate decay, training for 10 epochs.

training curve

training algorithm

2.3 Sampling from the UNet

Here, I implement sampling with the following algorithm. You can see that at epoch 1 many of the digits seem like scribbles or blobs, but by epoch 10, most of the digits are legible! At first, I didn’t realize that I had to move my models to the device, and it was taking 5 hours. Later, it only took 6 minutes to train my model :’)

sampling algorithm

epoch 1

epoch 5

epoch 10

2.4 Adding Class-Conditioning to UNet

Instead of only conditioning on time, we can also add class-conditioning (on the classes for digits 0-9) to our architecture! We do this by adding 2 more FCBlocks. We create a one-hot vector for our class conditioning, and set it to 0 if we don’t want it to condition on the class. We also only use the class conditioning vector 90% of the time (or dropout 10% of the time). The dimensions here were kind of annoying but I figured it out eventually!

2.5 Training the UNet

Now we train with class conditioning as well as time conditioning. This is very similar to the last part, except we set the class to the zero vector (dropout) with p_cond.

training algorithm

training curve

2.6 Sampling from the UNet

Now, we sample from our class conditioned UNet! This is also pretty similar to the previous section, except we use classifier-free guidance to get our estimate for the next x_t. Results at epoch 1 are guessable, and results at epoch 10 are much crisper.

sampling algorithm

epoch 1

epoch 5

epoch 10

To get rid of the annoying learning rate scheduler, I first tried the average learning rate which came out to be 1e-4. The results were not quite as good, so I tried a larger learning rate (1e-3) which had results comparable to that of the scheduler. I didn’t want to do 1e-2 for fear it would be too large, as that is our starting learning rate.

epoch 1

epoch 5

epoch 10

This project helped solidify my understanding of training and sampling models from scratch! I thought it was cool to get a more technical understanding and look under a hood, after a more intuition based understanding of diffusion models presented in lecture.