Cycle-Consistent Generative Adversarial Networks (CycleGAN)

In this blog, we will learn how to perform an image-to-image translation using CycleGAN. The image-to-image translation is a type of computer vision problem where the image is transformed from one domain to another domain. Let say edges to a photo.

An image-to-image translation generally requires a paired set of images to train a model. We can see this type of translation using conditional GANs. In those cases paired set of images is required. Take a look into paired set of images for translating edges to photo:

But for many cases, collecting paired set of training data is quite difficult. Let say we want an object transfiguration model where we want to translate an image of a horse to an image of zebra and vice versa.

For these types of tasks, even the desired output is not well defined then how we can collect a paired set of images. To solve this problem authors have proposed an approach called CycleGAN to transfer an image from X domain to Y domain without paired set of examples.

Cycle Consistent GAN

A CycleGAN captures special characteristics of one image domain and figures out how these image characteristics could be translated to another image domain, all without paired training examples. Let’s look at some unpaired training dataset.

Problem with these translations: In the case of paired training examples, the network has supervision power with corresponding label images. But in the case of the unpaired training dataset, we need to supervise at a set level where sets are X domain and Y domain. Now to train such network we need to find a mapping G: X → Y such that outputs from G(X) are indistinguishable from the Y domain. The possibility of such G mappings is infinite which does not guarantee meaningful input and output image pairs. Sometimes this type of network causes mode collapse. Mode collapse occurs when all input images map to the same output image.

Cycle Consistent: To cop up with the problem stated above the authors of the paper proposed that translation should be “Cycle Consistent”. For example, if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. Similarly, in case of image if we translate image from X domain to Y domain using a mapping G and then again translate this G(X) to X using mapping F we should arrive back at the same image.

So here, CycleGAN consists of two GAN network. Both of which have a generator and a discriminator network. To train the network it has two adversarial losses and one cycle consistency loss. Let’s see its mathematical formulation.

Mathematical Formulation of CycleGAN

Let say we are having two image domains X and Y. Now our model includes two mappings G: X → Y and F: Y → X. And we are having two adversarial losses DX and DY. DX will discriminate between F(Y) and X domain images. Similarly, DY will discriminate between G(X) and Y domain images. We will also have a cycle consistency loss to prevent a contradiction between learned mapping G and F.

In above figure (a), you can see the two different mappings G and F. Also figure (b) and (c) defines the forward cycle consistency loss ( x → G(x) → F(G(x)) ≈ x ) and backward consistency loss ( y → F(y) → G(F(y)) ≈ y ) respectively.

Network Architecture

There are two different architectures each for generator and discriminator network.

Generator network follows encoder-decoder architecture with three main parts:

  1. Encoder
  2. Transformer
  3. Decoder
Source

The encoder consists of three convolutional layers. An input image is passed through this encoder network and features volumes are taken as output. The transformer consists of 6 residual blocks. It takes feature volumes generated from the encoder layer as input and gives the output. And finally, the decoder layer which works as deconvolutional layers. It takes output from the transformer and generates a new image.

A Discriminator network is a simple network. It takes image as input and predicts whether it is part of real dataset or fake generated image dataset.

Source

This discriminator network is basically a patchGAN. A patchGAN is a simple convolutional network whereas the only difference is instead of mapping the input image to single scalar output, it maps input image to an NxN array output. Every individual in NxN output maps to a patch in the input image. In cycleGAN, it maps to 70×70 patches of the image. Finally, we take the mean of this output and optimize it to find the real of fake image. The advantage of using a patchGAN over a normal GAN discriminator is, it has fewer parameters than normal discriminator also it can work with arbitrary sized images.

Loss Function

Adversarial loss is applied to both mapping G and F with adversarial losses as DX and DY. These discriminator losses makes sure that the model is trained to generate data indistinguishable from real data for both image domains.

Adversarial losses alone can not guarantee that learned function can map individual input x to desired output y. Thus we need to use cycle consistency loss also. Cycle consistency loss makes sure that the image translation cycle is able to bring back x to the original image, i.e., x → G(x) → F(G(x)) ≈ x. Now full loss can be written as follows:

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F)

First, two arguments in the loss function are adversarial losses for both mappings. The last parameter is for cycle consistency loss. λ here defines the importance of the respective loss. Originally authors have used it as 10.

CycleGAN has produced compelling results in many cases but it also has some limitations. That’s all for CycleGAN introduction. In the next blog we will implement this algorithm in keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Leave a Reply