Category Archives: GAN

GAN to Generate Images of Climate Change

Generative adversarial networks (GANs) are deep learning models that are used to generate images similar to real images. Images generated by GANs are both realistic and personalized. But to generate images of high quality, the network requires a huge amount of data. Their usability limits in case of a low quantity of data. In this blog, we will discuss how we can use simulated data to generate images of climate change using GANs in case we are having a scarcity of training data.

Introduction

Recently researchers with the Montreal Institute for Learning Algorithms (MILA) used generative adversarial networks to generate images of the world after the flood. They tried to show how the world would change if some calamity like a flood occurs. They hope people would work to avert future weather conditions if they can see these changes. Researchers used simulated data in combination with real images to train multimodel unsupervised image-to-image translation with some modification to architecture.

Data Collection

Real Dataset

Researchers have collected 2000 real images of flooded and non-flooded scenes taken in various weather conditions, seasons, time and viewpoints. These images were taken from publicly available datasets Mapilary and Flickr. They used this dataset and trained CycleGAN but the generated images were not sufficiently realistic. To cop up with this problem they used simulated data.

Simulated Dataset

To generate simulated dataset researchers used the Unity 3D game engine. They created different types of building in combination with urban and rural environments. As a starting point, they generated 1000 unique pairs of images with flooded and non-flooded domains.

Domain Adaption Technique

While using simulated data, authors have seen domain gap between training dataset made up of simulated data and testing data made up of real images. To bridge this gap they used domain adaption technique inspired by unsupervised semantic segmentation. This technique is being implemented by using an adversarial classifier within MUNIT architecture.

Network Architecture

Researchers have tried different image-to-image translation GANs like CycleGAN, InstaGAN, and MUNIT. CycleGAN and InstaGAN were not able to generate as realistic water texture as MUNIT was able to. Finally, they used MUNIT architecture with some modifications.

MUNIT architecture relies on two generators and two discriminators to disentangle the style and content of the images. Such that during the generation of the image only style changes and content remains the same. To make MUNIT architecture more compatible with climate change use case, researchers have made the following changes to the architecture:

  1. Restriction of Cycle COnsistency Loss: In image-to-image translation GANs, cycle consistency loss is used to make sure that translation is cycle consistent. Let say, If we translate from English to French and then translate back to English sentence, we should arrive at the original sentence. In this architecture, researchers have restricted the network’s cycle consistency loss such that this loss is only computed on those regions that are not likely to be flooded. To do this they have used the binary masks of the areas.
  2. Introduction of semantic consistency loss: This loss confirms that the semantic segmentation structure for the generated image is the same as the source image except for the areas where changes occurred like the road to the flooded area.

This approach uses both real and simulated data to perform image-to-image translation to show the effects of climate change. This approach clearly shows that simulated data helps in generating more realistic images. Researchers are still working on to improve the results of this model. They are also working to create an interactive website.

“Authors aim to develop an interactive website that, given a user-entered address, will query the Google Street View API (Anguelov et al., 2010) to get an image of the location and alter it to display a plausible image of its climate future based on the predictions of climate models. We hope this tool will help communicate effectively on climate change related risks.

Referenced Research Paper: Using Simulated data to generate images of climate change

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Information Maximizing Generative Adversarial Network (InfoGAN): Introduction and Implementation

InfoGAN is an extension to the generative adversarial networks. Generative adversarial networks are trained to generate new images that look similar to the original images. But they do not provide any control over the generation of the new images. Let’s say you have trained a GAN network to generate new faces that look similar to the given dataset. But there you will not have any control over these faces such as the colour of the eyes, hairstyles, etc. But with the help of InfoGAN, we can achieve these results because InfoGAN is able to learn the disentangled representation.

Introduction

A generative adversarial network consist of two networks – a generator and a discriminator. Both of these networks are trained in an adversarial manner. While the generator tries to generate images similar to original images, discriminator tries to differentiate between images generated by the generator and original images. Training continues until discriminator is fooled half the time by generator and generator is able to generate images similar to original images.

Control Variables

In a general GAN, a random input noise vector is given as input to the generator network which does not provide any information to the generator network i.e. in which manner outputs should be generated. While InfoGAN uses latent code along with noise vector to generate images accordingly. Input to the generator of the InfoGAN can be given in two parts:

  1. Continuous noise vector, z.
  2. Latent codes which can be both discrete and continuous, c.

Let say we have trained our InfoGAN on MNIST handwritten digit datasets. Here discrete latent codes (0-9) can be used to generate specific digits between 0-9. While continuous latent codes can be used to generate digits with varying thickness and orientation.

Mutual Information

InfoGAN stands for information maximizing GAN. To maximize information, InfoGAN uses mutual information. In information theory, the mutual information between X and Y, I(X; Y ), measures the “amount of information” learned from knowledge of random variable Y about the other random variable X. In InfoGAN there should be high mutual information between latent code c and generated images.

To maximize this mutual information, the InfoGAN model requires an extra network named as an auxiliary model. This auxiliary model shares all the weights from the discriminator network except the output layer. As the discriminator network has an output layer which predicts the given input image is real or fake, the auxiliary network predicts the latent codes.

So the InfoGAN will consist of three networks – Generator, Discriminator, and auxiliary network. Both the discriminator and auxiliary networks are used to improve the generator network. Here, the generation of real looking images by generator network is regularized by the discriminator network and maximization of mutual information is regularized by the auxiliary network.

Implementation

In this blog, we will implement InfoGAN using MNIST handwritten digit dataset. To maximize the information we will only use discrete codes to generate particular digits. In addition to this, you can also use two continuous variables to define the rotation and thickness of the generated digits.

Imports and Initialization

Generator Network

Input to the generator network consists of shape (110, 1), where 100 is the noise vector size and 10 is the latent code size. Here latent codes are one-hot encoded discrete number between 0-9. I have used deconvolutional layers to upsample and finally produce the shape of (28,28,1). Batch normalization is used to improve the quality of the trained network and for stabilization.

Discriminator and Auxiliary Network

As I have already told that auxiliary network shares all the weights of the discriminator network except the output layer there is no need to create two separate functions for this. Networks take images of shape (28, 28, 1) as input. convolutional, batch normalization and pooling layers are used to create the network. The output shape of the discriminator network is 1 as it only predicts the input image is real or fake. But the output shape of the auxiliary network is 10 as it predicts latent code.

Combined Model

A combined model is created to train the generator network. Here we do discriminator network as non-trainable as discriminator network is trained separately. The combined model takes random noise and latent code as input. This input is fed to the generator network and the generated image is fed to both discriminator and auxiliary network.

Training InfoGAN

Training a GAN model is always a difficult task. A careful hyperparameter tuning is always required. We will use the following steps to train the InfoGAN model.

  1. Normalize the input images from the MNIST dataset.
  2. Train the discriminator model using real images from the MNIST dataset.
  3. Train the discriminator model using real images and corresponding labels.
  4. Train the discriminator model using fake images generated from the generator network.
  5. Train the auxiliary network using fake images generated from the generator and random latent codes.
  6. Train the generator network using a combined model without training the discriminator.
  7. Repeat the steps from 2-6 for some iterations. I have trained it for 60000 iterations.

Generation

Now we will generate images from the trained gan model. The generator will be provided with random noise and one hot encoded input digit between 0-9 whichever digit we want to generate.

Here are the generated results from the model:

Referenced Research Paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

An Introduction To The Progressive Growing of GANs

Generative adversarial networks are famous for generating images. But generating images with high resolution was quite difficult until the introduction of a new training methodology known as the progressive growing of GANs. Progressive growing GANs architecture was proposed by NVIDIA in the paper published in 2017 titled as ” PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION“. This architecture starts with low-resolution images such as 4×4 size and then add up the layers progressively to generate images of high resolution such as 1024×1024.

Traditional GANs were facing some real issues to generate images of high quality. Here I am listing some of the major problems:

  1. Discriminator will be easily able to differentiate b/w real and fake if generated images are large.
  2. Generating such high-quality images also requires large GPU memory due to higher computational cost.
  3. Due to the high memory requirement if we take less batch size it will also make an unstable GAN model.
  4. It was also difficult to produce both large and fine detailed images.

In contrast to these difficulties, progressive growing of GANs removed some obstacles coming in between creating high-quality images. Some of them are:

  1. It reduces training time.
  2. The model becomes more stable since we can train it with a mini-batch of efficient size.

Generally, a generative adversarial network consists of two network generator and discriminator. The generator takes a latent vector as input and produces a generated image. And discriminator discriminates between these generated images with original as real vs fake. Training of this model will proceed until images generated from the generator are not able to fool the discriminator half the time. Similarly, progressive GAN architecture consists of both generator and discriminator networks where both networks are a mirror image of each other.

The Network Architecture

Both generator and discriminator starts with very small image 4×4. Original images are transformed into 4×4 size to train the model. Since these images are quite small training will be fast. Once we have fed enough 4×4 images to the discriminator network, we will progressively add up new layers for an 8×8 size image and similarly 16×16 until we reach image resolution to 1024×1024. Nearest neighbor interpolation and average pooling were used for doubling and halving the size of the image. The transition from a 4×4 network to an 8×8 network was done smoothly by fading in new layers.

1. Fading in new Layer

We will see this fading by using an example of the transition from 16×16 to 32×32 resolution images.

In this example, the current resolution of the image is 16×16. Firstly the model will be trained on 16×16 images. The original images are transformed into 16×16 size for training. After training it on a sufficient number of images we will progressively add new layers. In generator nearest neighbor filtering is used to upsample the image while in discriminator average pooling is applied to downsample the image size.

Now to increase the network progressively a residual block for 32×32 resolution is added. During training, this new block is not directly added but faded in. This block consists of two convolution layers and 1 upsampling layer in the generator network. While two convolution layer and one average pooling layer in discriminator network. The new block is multiplied with α and the previous(16×16) is multiplied with (1-α). This α value linearly increases from 0 to 1. Even after fading in of new layers all previous layers in the model will remain trainable.

Similarly, if you want to produce an image of higher resolution more layers will be added progressively. 1×1 convolution layer is added to the last layer in the generator to convert into an RGB image. Similarly, a 1×1 convolution layer is added at top of the discriminator network to get from the RGB image (or generated image).

During the training of progressive GAN, the network starts from 4×4 size and adds up layer progressively to reach the size of 1024×1024. Leaky relu is used for training the model. To train the model it took 4 days on 8 Tesla V100 GPUs.

Source

2. Minibatch Standard Deviation

Generative adversarial networks has a tendency to capture only little variation from training data. Sometimes all input noise vectors generate similar looking images. This problem is also known as ‘mode collapse’. To add a little variation to generated images, authors of the progressive gans have used minibatch standard deviation.

Here standard deviation of each feature in the activation map is calculated and then averaged over the minibatch. Through this new activation, maps are created. These new activation maps are added at the end of the discriminator network.

3. Equalized Learning Rate

In this progressive GAN architecture, authors have not initialized weights carefully. But they are scaling weights dynamically at run time. Here wˆi = wi/c, where wi are the weights and c is the per-layer normalization constant. In general, due to modern initializer, some parameters have larger dynamic range which causes them to converge later than some other parameters. This can cause both low and high learning rates at the same time. But the equalized learning rate ensures the learning rate the same for all weight parameters.

4. Pixel-wise Feature Vector Normalization in Generator

Generally in the generative adversarial network, batch normalization is used after the convolutional layer. But here in progressive GAN, feature vector in each pixel is normalized to unit length after the convolution layer. Also, this normalization is done only in the generator network, not in discriminator network. This technique prevents the escalation of signal magnitude effectively.

This new architecture with some interesting idea of minibatch standard deviation, equalized learning rate, fading in a new layer, and pixel-wise normalization has shown very promising results. With the help of progressively growing of GAN, the model is able to generate a high-quality image. Also, the training is quite stable. This GAN is able to generate high-resolution images with photo-realistic synthetic images.

Referenced Research Paper: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Style Generative Adversarial Network (StyleGAN)

Generative adversarial network( GAN ) generates synthetic images that are indistinguishable from authentic images. A GAN network consists of a generator network and a discriminator network. Generator network tries to generate new images from a noise vector and discriminator network discriminate these generated images from the original dataset. While training the GAN model, the generator network tries to fool the discriminator and discriminator to improve itself to differentiate between real and fake images. This training will continue until the discriminator model is fooled half the time and the generator is not able to generate data similar to original data distribution.

Since the introduction of generative adversarial networks in 2014, there has been many improvements in its architecture. Deep convolutional GAN, semisupervised GAN, conditional GAN, CycleGAN and many more. These variants of GAN mainly focuses on improving the discriminator architecture and the generator model continues to operate as the black box.

The style generative adversarial network proposed an alternative generator architecture that can control the specific features of the output image such as pose, identity, hairs, freckles( when trained on face dataset ) even without compromising the image quality.

Baseline Architecture

The baseline architecture for StyleGAN is taken from another recently introduced GAN variant: Progressive GAN. In progressive GAN, both generator and discriminator grow progressively: starting from low resolution, It adds up layers to the model which can extract very fine details. In progressive GAN images start from 4×4 and generate images up to 1024×1024 size. This progressively growing architecture speeds up and stabilizes the training process which helps in generating such high-quality images.

StyleGAN Architecture

Progressive GAN was able to generate high-quality images but to control the specific features of the generated image was difficult with its architecture. To control the features of the output image some changes were made into Progressive GAN’s generator architecture and StyleGAN was created. Here is the architecture of the generator for the StyleGAN.

Along with the generator’s architecture, the above figure also differentiates between a traditional generator network and a Style-based generator network. To develop StyleGAN’s generator network, there are some modifications done in the progressive GAN. We will discuss these modifications one by one.

1. Removal of Traditional Input Layer

In traditional generator networks, a latent vector is provided through an input layer. This latent vector must follow the probability density of training data which ca leads to some degree of entanglement. Let’s say if training data consist of one type of image greater than other variations, then it can lead to producing images with features more related to that large type of data. So instead of a traditional input layer, the synthesis network( generator network) starts with a 4 × 4 × 512 constant tensor.

2. Mapping Network and AdaIN

Mapping network embeds the input latent code to intermediate latent space which can be used as style and incorporated at each block of synthesis network. As you can see in the above generator’s architecture, latent code is fed to 8 fully connected layers and an intermediate latent space W is generated.

This intermediate latent space W is passed through a convolutional layer “A” (shown in the architecture) and specializes in styles ( y = ( ys , yb )) to transform and incorporate into each block of the generator network. To incorporate this into each block of the generator network, first, the feature maps (xi) from each block are normalized separately and then scaled and biased using corresponding styles. This is also known as adaptive instance normalization (AdaIN).

This AdaIN operation is added to each block of the generator network which helps in deciding the features in the output layer.

3. Bilinear Upsampling

This generator network grows progressively. Usually upsampling in a generator network one uses transposed convolutional network. But here in StyleGAN, it uses bilinear upsampling to upsample the image instead of using the transposed convolution layer.

4. Noise Layers

As you can see in the architecture of the StyleGAN, noise layers are added after each block of the generator network( synthesis network ). This noise consists of uncorrelated Gaussian noise which is first broadcasted using a layer “B” to the shape of feature maps from each convolutional block. Using this addition of noise StyleGAN can add stochastic variations to the output.

There are many stochastic features in the human face like hairs, stubbles, freckles, or skin pores. In traditional generator, there was only a single source of noise vector to add these stochastic variations to the output which was not quite fruitful. But with adding noise at each block of synthesis network in the generator architecture make sure that it only affects stochastic aspects of the face.

5. Style Mixing

This is basically a regularization technique. During training, images are generated using two latent codes. It means two latent codes z1 and z2 are taken to produce w1 and w2 styles using a mapping network. In the Synthesis network a split point is selected and w1 style is applied up to that point and w2 style is applied after that point and the network is trained in this way.

In the synthesis network, these styles are added at each block. Due to this network can assume that these adjacent styles are correlated. But style mixing can prevent the network from assuming these adjacent styles are correlated.

Source

These were the basic changes made in baseline architecture to improve it and create a StyleGAN architecture. Other things like, generator architecture, mini-batch sizes, Adam hyperparameters and moving an exponential average of the generator are the same as baseline architecture.

Summary

StyleGAN has proven to be promising at producing high-quality realistic images also gives control to generate images with particular features. It was clearly seen that traditional generators lag far behind than this improved generator network. Concepts like mapping network and AdaIN can really be very helpful in GAN architecture and other research work.

Referenced Research Paper: 1. A Style-Based Generator Architecture for Generative Adversarial Networks 2. Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of CycleGAN for Image-to-image Translation

CycleGAN is a variant of a generative adversarial network and was introduced to perform image translation from domain X to domain Y without using a paired set of training examples. In the previous blog, I have already described CycleGAN in detail. In this blog, we will implement CycleGAN to translate apple images to orange images and vice-versa with the help of Keras library. Here are some recommended blogs that you should refer before implementing CycleGAN:

  1. Cycle-Consistent Generative Adversarial Networks (CycleGAN)
  2. Image to Image Translation Using Conditional GAN
  3. Implementation of Image-to-image translation using conditional GAN

Load the Dataset And Preprocess

CycleGAN does not require any paired dataset as compared to other image translation algorithms. Hence here we will use two sets of datasets. One consists of apple images and the other consists of orange images. Both the datasets are not paired with each other. Here are some images from the dataset:

You can download the dataset from this link. Or run the following command from your terminal.

Dataset consists of four folders: trainA, trainB, testA, and testB. ‘A’ dataset consists of apple images and the ‘B’ dataset consist of orange images. Training set consists of approx 1000 images for each type and the test set consists of approx 200 images corresponding to each type.

So, let’s first import all the required libraries:

Dataset is a little preprocessed as it contains all images of equal size (256, 256, 3). Other preprocessing steps that we are going to use are normalization and random flipping. Here we are normalizing every image between -1 to 1 and randomly flipping horizontally. Here is the code:

Now load the training images from the directory into a list.

Build the Generator

The network architecture that I have used is very similar to the architecture used in image-to-image translation with conditional GAN. The major difference is the loss function. In CycleGAN two more losses have been introduced. One is cycle consistency loss and the other is identity loss.

Here generator network is a U-net architecture. This U-net architecture consists of the encoder-decoder model with a skip connection between encoder and decoder. Here we will use two generator networks. One will translate from apple to orange (G: X -> Y) and the other will translate from orange to apple (F: Y -> X). Each generator network is consists of encoder and decoder. Each encoder block is consist of three layers (Conv -> BatchNorm -> Leakyrelu). And each block in decoder network is consist of four layers (Transposed Conv -> BatchNorm -> Dropout -> Relu). The generator will take an image as input and outputs a generated image. Both images will have a size of (256, 256, 3). Here is the code:

Build the Discriminator

Discriminator network is a patchGAN pretty similar to the one used in the code for image-to-image translation with conditional GAN. Here two discriminators will be used. One discriminator will discriminate between images generated by generator A and orange images. And another discriminator is used to discriminate between image generated by generator B and apple images.

This patchGAN is nothing but a convolution network. The difference between patchGAN and normal convolution network is that instead of producing output as single scalar vector it generates an NxN array. This NxN array maps to the patch from the input images. And then takes an average to classify the whole image as real or fake.

Combined Network

Now we will create a combined network to train the generator model. Here both discriminators will be non-trainable. To train the generator network we will also use cycle consistency loss and identity loss.

Cycle consistency says that if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. To calculate the cycle consistency loss first pass the input image A to generator A and then pass the predicted output to the generator B. Now calculate the loss between image generated from generator B and input image B. Same goes while taking image B as input to the generator B.

In case of identity loss, If we are passing image from domain A to generator A and trying to generate image looking similar to image from domain B then identity loss makes sure that even if we pass image from domain B to generator A it should generate image from domain B. Here is the code for combined model.

Loss, Optimizer and Compile the Models

Here we are using mse loss for the discriminator networks and mae loss for the generator network. Optimizer use here is Adam. The batch size for the network is 1 and the total number of epochs is 200.

Train the Network

  1. Generate image from generator A using image from domain A, Similarly generate an image from generator B using image from domain B.
  2. Train discriminator A on batch using images from domain A and images generated from generator B as real and fake image respectively.
  3. Train discriminator B on batch using images from domain B and images generated from generator A as real and fake image respectively.
  4. Train generator on batch using the combined model.
  5. Repeat steps from 1 to 4 for every image in the training dataset and then repeat this process for 200 epochs.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of Image-to-image translation using conditional GAN

In the previous blog, we have learned what is an image-to-image translation. Also, we discussed how it can be performed using conditional GAN. Conditional GAN is a type of generative adversarial network where discriminator and generator networks are conditioned on some sort of auxiliary information. In image-to-image translation using conditional GAN, we take an image as a piece of auxiliary information. With the help of this information, the generator tries to generate a new image. Let’s say we want to translate the edge image of a shoe to a real looking image of a shoe. Here we can condition our GAN with the edge image.

To know more about conditional GAN and its implementation from scratch, you can read these blog:

  1. Conditional Generative Adversarial Networks (CGAN): Introduction and Implementation
  2. Image to Image Translation Using Conditional GAN

Next, in this blog, we will implement image-to-image translation from scratch using Keras functional API.

Dataset and Preprocessing

To implement an image-to-image translation model using conditional GAN, we need a paired dataset as shown in the below image.

Center for Machine Perception (CMP) at the Czech Technical University in Prague provides rich source of the paired dataset for image-to-image translation which we can use here for our model. In this blog, we will use edges to shoe dataset provided by this link. This dataset consists of a train and validation set. The training set is consist of 49825 images and validation set is consist of 200 images. This dataset consist of some preprocessed images which contains edge and shoe in a single image as shown below:

These images have the size of (256, 512, 3) where 256 is the height, 512 is the width and the number of channels is 3. Now to bifurcate this image into input and output image, we can just slice this image from mid. After segregating we also need to normalize the image. These images consist of values b/w 0 to 255 and to make training faster and reducing the chances of getting stuck in local minima we need to normalize these images. we will normalize these images between -1 to 1. Here is the code to preprocess the image.

In the preprocessing step we have only used the normalization technique. To preprocess the images we can also do some random jittering and random mirroring as mentioned in the paper. To perform random jittering you just need to upscale the image to 286×286 and then randomly crop to 256×256. To perform random mirroring you need to flip the image horizontally.

Generator Network

Generator network for this conditional GAN architecture is a modified U-net architecture. This U-net architecture consists of an encoder-decoder network with skip connections between encoder and decoder. Each encoder block is consist of three layers (Conv -> BatchNorm -> Leakyrelu). Downsampling in the encoder layer is performed using the strided convolutional layers. Each block in decoder network is consist of four layers (Transposed Conv -> BatchNorm -> Dropout -> Relu). Dropout is only applied for the first three blocks in the decoder network. The input shape for the network is (256, 256, 3). Output shape is also (256, 256, 3) which will be a generated image.

Normally in a generative adversarial network, input to a generator is a noise vector. But here we will use a combination of noise vector and edge image as input to the generator. We will take a noise vector of size 100 and then use a dense layer and then reshape it to concatenate with image input. Here is the code for the generator network. The model looks a little lengthy but don’t worry these are just repeated U-net blocks for encoder and decoder.

Discriminator Network

Here discriminator is a patchGAN. A patchGAN is basically a convolutional network where the input image is mapped to an NxN array instead of a single scalar vector. For this conditional GAN, the discriminator takes two inputs. One is edge image and the other is the shoe image. Both inputs are of shape 9256, 256, 3). The output shape of this network is (30, 30, 1). Here each 30×30 output patch classifies the 70×70 portion of the input image.

Here each block in the discriminator is consist of 3 layers (Conv -> BatchNorm -> LeakyRelu). I have used the Gaussian Blurring layer to reduce the dominance of discriminator while training. Here is the full code.

Combined Network

Now we will create a combined network to train the generator model. Firstly this network takes noise vector and edge image as input and generates a new image using a generator network. Now the output from the generator network and edge image is fed to the discriminator network to get the output. But here discriminator will be non-trainable. Here is the network code.

Training

I have used binary cross-entropy loss for the discriminator network. For the generator network, I have coupled the binary cross-entropy loss with mae loss. This is because, for image-to-image translation, the generator’s duty is not only to fool the discriminator but also to generate real-looking images. I have used Adam optimizer for both generator and discriminator but the only difference is that I have kept a low learning rate for the discriminator to make it less dominant while training. I have used a batch size of 1. Here are the steps to train the explained conditional GAN.

  1. Train the discriminator model with real output images with patch labels of values 1.
  2. Train the discriminator model with images generated from a generator with patch labels of values 0.
  3. Train the generator network using the combined model.
  4. Repeat the steps from 1 to 3 for each image in the training dataset and then repeat all this for some number of epochs.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Cycle-Consistent Generative Adversarial Networks (CycleGAN)

In this blog, we will learn how to perform an image-to-image translation using CycleGAN. The image-to-image translation is a type of computer vision problem where the image is transformed from one domain to another domain. Let say edges to a photo.

An image-to-image translation generally requires a paired set of images to train a model. We can see this type of translation using conditional GANs. In those cases paired set of images is required. Take a look into paired set of images for translating edges to photo:

But for many cases, collecting paired set of training data is quite difficult. Let say we want an object transfiguration model where we want to translate an image of a horse to an image of zebra and vice versa.

For these types of tasks, even the desired output is not well defined then how we can collect a paired set of images. To solve this problem authors have proposed an approach called CycleGAN to transfer an image from X domain to Y domain without paired set of examples.

Cycle Consistent GAN

A CycleGAN captures special characteristics of one image domain and figures out how these image characteristics could be translated to another image domain, all without paired training examples. Let’s look at some unpaired training dataset.

Problem with these translations: In the case of paired training examples, the network has supervision power with corresponding label images. But in the case of the unpaired training dataset, we need to supervise at a set level where sets are X domain and Y domain. Now to train such network we need to find a mapping G: X → Y such that outputs from G(X) are indistinguishable from the Y domain. The possibility of such G mappings is infinite which does not guarantee meaningful input and output image pairs. Sometimes this type of network causes mode collapse. Mode collapse occurs when all input images map to the same output image.

Cycle Consistent: To cop up with the problem stated above the authors of the paper proposed that translation should be “Cycle Consistent”. For example, if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. Similarly, in case of image if we translate image from X domain to Y domain using a mapping G and then again translate this G(X) to X using mapping F we should arrive back at the same image.

So here, CycleGAN consists of two GAN network. Both of which have a generator and a discriminator network. To train the network it has two adversarial losses and one cycle consistency loss. Let’s see its mathematical formulation.

Mathematical Formulation of CycleGAN

Let say we are having two image domains X and Y. Now our model includes two mappings G: X → Y and F: Y → X. And we are having two adversarial losses DX and DY. DX will discriminate between F(Y) and X domain images. Similarly, DY will discriminate between G(X) and Y domain images. We will also have a cycle consistency loss to prevent a contradiction between learned mapping G and F.

In above figure (a), you can see the two different mappings G and F. Also figure (b) and (c) defines the forward cycle consistency loss ( x → G(x) → F(G(x)) ≈ x ) and backward consistency loss ( y → F(y) → G(F(y)) ≈ y ) respectively.

Network Architecture

There are two different architectures each for generator and discriminator network.

Generator network follows encoder-decoder architecture with three main parts:

  1. Encoder
  2. Transformer
  3. Decoder
Source

The encoder consists of three convolutional layers. An input image is passed through this encoder network and features volumes are taken as output. The transformer consists of 6 residual blocks. It takes feature volumes generated from the encoder layer as input and gives the output. And finally, the decoder layer which works as deconvolutional layers. It takes output from the transformer and generates a new image.

A Discriminator network is a simple network. It takes image as input and predicts whether it is part of real dataset or fake generated image dataset.

Source

This discriminator network is basically a patchGAN. A patchGAN is a simple convolutional network whereas the only difference is instead of mapping the input image to single scalar output, it maps input image to an NxN array output. Every individual in NxN output maps to a patch in the input image. In cycleGAN, it maps to 70×70 patches of the image. Finally, we take the mean of this output and optimize it to find the real of fake image. The advantage of using a patchGAN over a normal GAN discriminator is, it has fewer parameters than normal discriminator also it can work with arbitrary sized images.

Loss Function

Adversarial loss is applied to both mapping G and F with adversarial losses as DX and DY. These discriminator losses makes sure that the model is trained to generate data indistinguishable from real data for both image domains.

Adversarial losses alone can not guarantee that learned function can map individual input x to desired output y. Thus we need to use cycle consistency loss also. Cycle consistency loss makes sure that the image translation cycle is able to bring back x to the original image, i.e., x → G(x) → F(G(x)) ≈ x. Now full loss can be written as follows:

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F)

First, two arguments in the loss function are adversarial losses for both mappings. The last parameter is for cycle consistency loss. λ here defines the importance of the respective loss. Originally authors have used it as 10.

CycleGAN has produced compelling results in many cases but it also has some limitations. That’s all for CycleGAN introduction. In the next blog we will implement this algorithm in keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Image to Image Translation Using Conditional GAN

The image-to-image translation is a well-known problem in the field of image processing, computer graphics, and computer vision. Some of the problems are converting labels to street scenes, labels to facades, black&white to a color photo, aerial images to maps, day to night and edges to photo. Take a look into these conversions:

Earlier each of these tasks is performed separately. But with the help of convolutional neural networks (CNNs), communities are taking big steps in this field. Because of CNN, most of the work is automatic as we train the model in an end to end fashion. But still, we need to define a loss function that tries to achieve the target we want. Most of us take the loss function lightly but this is the most important thing that you should always give your attention to when training deep learning models. For instance, if we take euclidean distance as our loss function for image-to-image translation, it would produce blurred images because it minimizes by averaging all outputs. Thus we need a meaningful loss function corresponding to each task and this is something that is always painful. This is where the generative adversarial network (GAN) comes.

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions.

Now with the help of GANs, we can generate a realistic-looking image. But in image-to-image translation, we do not just want to generate a realistic-looking image but also output image should be translated from the input image. To perform this type of task we need a conditional GAN, so you must first understand this before moving forward (To know in detail about conditional GAN you can follow this blog).

In image-to-image translation with conditional GAN, the generator is provided with the input image and a noise vector both. Now generator will generate an image that is translated from the input image and indistinguishable from original data (Discriminator will be fooled). To train this model we need some paired training examples as shown below:

Network Architecture

Here the network architecture consists of two models, generator and discriminator. First, take a look into the generator model.

Generally, a generator network in GAN architecture takes noise vector as input and generates an image as output. But here input consists of both noise vector and an image. So the network will be taking image as input and producing an image as output. In these types of problems generally, an encoder-decoder model is being used.

In an encoder-decoder network, first, the input is being down-sampled till a bottleneck layer and then upsampled to generate image again. In our problem of image-to-image translation, input and output differ in surface appearance but both have the same structure. So to make this encoder-decoder network-rich, the low-level information is shared between the input and output. For this, skip connections are added which forms an U-net architecture as shown in the above figure.

Here the discriminator model is a patchGAN. A patchGAN is nothing but a conv net. The only difference is that instead of mapping an input image to a single scalar vector, it maps to an NxN array. Where each individual element in NxN array maps to a patch in the input image. Finally, averaging is done to find the full input image is real or fake.

Reason for using patchGAN: The generator model is being trained using discriminator loss and also the L1 loss. It is well known that L1 losses produce blurry images. L1 losses fail to capture high frequencies in images while in many cases they are able to capture low frequencies. Now the task for discriminator will be only to capture high frequency. By straining the model’s attention to local image patches using patchGAN, it clearly helped in capturing high frequencies in the image.

Loss Function

Generally, loss function for a conditional GAN can be stated as follows:

Here generator G tries to minimize this loss function whereas discriminator D tries to maximize it. In the paper, authors have coupled it with L1 loss function such that the generator task is to not only fool the discriminator but also to generate ground truth near looking images. So final loos function would be:

Paper has suggested that this is a really promising approach in many image-to-image translation tasks but it always requires a paired training dataset which is sometimes difficult to get. That’s all for this blog, in the next blog we will implement its application (pix2pix) using keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Image-to-Image Translation with Conditional Adversarial Networks

Creating a Deep Convolutional Generative Adversarial Networks (DCGAN)

In this tutorial, we will learn how to generate images of handwritten digits using the deep convolutional generative adversarial network.

What are GANs?

GANs are one of the most interesting ideas in deep learning today. In GANs two networks work adversarially. One is generator network which tries to generate new images which looks similar to original image dataset. Another is discriminator network which discriminates between real images (images from the dataset) and fake images (images generated from generator network).

During training, generator progressively becomes better at generating images that can not be distinguishable from real images and discriminator become more accurate at discriminating them. Training gets completed when discriminator can no longer discriminate between images generated by generator and real images.

I would recommend you to go through this blog to learn more about generative adversarial networks. Now we will implement Deep convolutional adversarial Networks using MNIST handwritten digits dataset.

Import All Libraries

Initialization

Generator Network

Generator network takes random noise as input and generates meaningful images which looks similar to real images. Inputs have a shape of vector size 100. Output images have shape of (28, 28, 1) which is same as images shape in MNIST dataset.

In generator network we use deconvolutional layers to upsample the input to image size. In convolutional layers network tries to extract some useful features while in deconvolutional layers, the network tries to add some interesting features to upsample an image. To know more about deconvolution you can read this blog. I have also added batch normalization layers to improve the quality of model and stabilizing the training process. For this network, I have used cross-entropy loss and Adam optimizer. Here is the code.

Discriminator Network

Discriminator network discriminates between real and fake images. So it is a binary classification network. This network consists of

  1. the input layer of shape (28, 28, 1),
  2. Three hidden layers of 16, 32 and 64 filters and
  3. the output layer of shape 1.

I have also used batch normalization layer after every conv layer to stabilize the network. To downsample, I have used average pooling instead of max pooling. Finally compiled the model with cross entropy loss and Adam optimizer. Here is the code.

Combined Model

After creating generator and discriminator network, we need to create a combined model of both to train the generator network. This combined model takes the random noise as input, generates images from generator and predict label from discriminator. The gradients generated from this are used to train the generator network. In this model, we do not train the discriminator network. Here is the code.

Training of GAN model:

To train a GAN network we first normalize the inputs between -1 and 1. Then we train this model for a large number of iterations using the following steps.

  1. Take random input data from MNIST normalized dataset of shape equal to half the batch size and train the discriminator network with label 1 (real images).
  2. Generate samples from generator network equal to half the batch size to train the discriminator network with label 0 (fake images).
  3. Generate the random noise of size equal to batch size and train the generator network using the combined model.
  4. Repeat steps from 1 to 3 for some number of iterations. Here I have used 30000 iterations.

Generating the new images from trained generator network

Now our model has been trained and we can discard the discriminator network and use the generator network to generate the new images. We will take random noise as input and generate the images. After generating the images we need to rescale them to show the outputs. Here is the code.

So, this was the implementation of DCGAN using MNIST dataset. In the next blogs we will learn other GAN variants.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Conditional Generative Adversarial Networks (CGAN): Introduction and Implementation

Generative adversarial networks (GANs) are trained to generate new images that look similar to original images. Let say we have trained a GAN network on MNIST digit dataset that consists of 0-9 handwritten digits. Now if we generate images from this trained GAN network, it will randomly generate images which can be any digit between 0 to 9. But if we want to generate images only for a particular digit, it will be difficult. One way is to find a mapping between random noise given as input to generator and images generated by the network. But with the variations in random input noise, it is really difficult to find the mapping. Here comes the conditional GANs.

A GAN network will be a conditional GAN if we train both the discriminator and generator conditioned on some sort of auxiliary information. This information can be class labels, black&white images, and other modalities. In this blog, we will learn how to generate images from a conditional GANs (cGAN) conditioned on the class label.

After the introduction of conditional GANs in 2014, there has been a wide range of applications developed based on this network. Some of them are:

  1. Image to Image Translation: With the use of cGAN there has been a various implementation of image to image translations like translation from day to night, translation from black and white to color, translation from sketches to color photographs, etc.


  2. Face Aging: Uses conditional GANs to generate face photographs with different ages, from younger to older.


  3. Text to Image: Inspired by the idea of conditional GANs, generates images given text explaining the image.


That’s enough for the introduction now we will implement a conditional GANs to generate handwritten digits conditioned on class labels.

Here we will use MNIST digits dataset to train this conditional GAN. This dataset consists of images of digits ranging from 0-9 and corresponding labels. Create a cgan.py file and insert the following code:

Line 1 imports all the required layers from keras. Line 2 and 3 imports Model and optimizer respectively. Line 4 imports required MNIST dataset from keras. If you haven’t done it earlier it will download the data first. Line 5 imports the numpy package.

As we have imported all the necessary packages, next we will create our cGAN architecture. To create this network, first, we will create a class and initialize all the necessary variables.

In the above code, Line 1 creates a class named as GAN. Line 2 defines an init function which is used to initialize all the required variables. Line 4 loads the data which consists of training and test data both with their labels. Line 5-9 initializes hyperparameters required for the network. Line 10-12 call the functions generator, the discriminator and combined model which we will later define in this class.

After initializing all the required variables we will next define the generator function of class GAN.

In generator we are taking two inputs, one is random noise of shape (100,) and another is class label of shape (1,) which will be an integer between 0-9. This extra input taken as class label will be our condition to GAN. During test time we will use this class label as a condition to generate images for that specific class only.

In the above code, Line 3-6 is for our input of class label. Here we have added Embedding layer to this conditional input which consists of weights and will be trained during the generator training. This embedding layer converts positive integers to a dense vector of fixed size. Here we have taken embedding of size 50. After this embedding layer we have added a dense layer and then reshaped it to make compatible during concatenation with random noise.

Line 8-9 creates an input layer for random noise and reshape it. Line 11 and 12 concatenate both the inputs after reshaping and then applied the batch norm. Batch normalization is really helpful in improving the quality of the model and stabilizing the training process.

Line 13-15 are for two upsampling layers (deconvolutional layers) with added batch normalization layer. Line 16 is an output layer with shape equals real images (28, 28, 1). Line 17, we create a generator model. Line-18 is for compiling the model where loss is cross-entropy and optimizer is Adam optimizer.

This GAN class is also consist of discriminator network which is also conditioned on class labels.

In the above code, line 3-6 are doing the same for converting class label input to embedding as we have seen in the case of generator network except for reshaping it to (28, 28, 1) instead of reshaping it to (7, 7, 1). Line 8 describes the second input layer which is an image (either real or fake). then in Line 10 we concatenate both the inputs to make it compatible with our discriminator network.

Line 11-19 is basically a combination of conv layer -> batch norm layer -> average pooling layer. Convolution layers are having filter size of 16, 32 and 64. Here we have used the average pooling layer instead of using max pooling layer as it is recommended to not use max pooling layers with GAN architectures.

Finally, from line 20-21 we first flatten the output from the previous layer and added a fully connected layer with shape 1 which is treated as output layer for our discriminator model. This model will discriminate between real and fake image. Line 22-23 we created discriminator model which takes two inputs with one output and then compiled the model with cross-entropy loss and Adam optimizer.

This was our discriminator model, now we will create a combined model which consists of both discriminator and generator to train the generator network.

In the above code, we created a combined model which takes two inputs one is random noise of shape (100, ) and another is the class label of shape (1, ). Generator model takes these two inputs and generates the new image which is then fed to the discriminator model to predict the output. Here, only the generator is being trained and the discriminator is made non-trainable.

Next we will train the whole GAN networks using these networks.

In the above code, from line 3-4, first, we first normalize the input image in the range of -1 to 1 and then reshape it to (28,28, 1). From line 9-11 we randomly select the real images and their corresponding labels equals to half the batch size. Line 13, we train the discriminator network using these real images conditioned on real class labels.

Then Line 15 we select the random labels between 0-9 of half the batch size for the input to the generator because during training we can not have the class labels for random noise to the generator. Then Line 16-17 we take random noise of shape (half_batch_size, 100) and generate the images from generator network which will be fake input images to the discriminator. Then Line 19 we train the discriminator network with these fake generated images which is conditioned on random class labels.

Finally, in line 21-22, we train our generator network using the combined model. Here we take the random noise and random class labels as input to the combined model.

We train this network for some number of iterations until our network is not able to fool the discriminator network. Finally, after training this network we can discard the discriminator network and use the generator network to generate new images conditioned on class labels.

Above code is used to test our trained cGAN. Here are the outputs generated from the network.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

References: