Tag Archives: CNN

Image to Image Translation Using Conditional GAN

The image-to-image translation is a well-known problem in the field of image processing, computer graphics, and computer vision. Some of the problems are converting labels to street scenes, labels to facades, black&white to a color photo, aerial images to maps, day to night and edges to photo. Take a look into these conversions:

Earlier each of these tasks is performed separately. But with the help of convolutional neural networks (CNNs), communities are taking big steps in this field. Because of CNN, most of the work is automatic as we train the model in an end to end fashion. But still, we need to define a loss function that tries to achieve the target we want. Most of us take the loss function lightly but this is the most important thing that you should always give your attention to when training deep learning models. For instance, if we take euclidean distance as our loss function for image-to-image translation, it would produce blurred images because it minimizes by averaging all outputs. Thus we need a meaningful loss function corresponding to each task and this is something that is always painful. This is where the generative adversarial network (GAN) comes.

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions.

Now with the help of GANs, we can generate a realistic-looking image. But in image-to-image translation, we do not just want to generate a realistic-looking image but also output image should be translated from the input image. To perform this type of task we need a conditional GAN, so you must first understand this before moving forward (To know in detail about conditional GAN you can follow this blog).

In image-to-image translation with conditional GAN, the generator is provided with the input image and a noise vector both. Now generator will generate an image that is translated from the input image and indistinguishable from original data (Discriminator will be fooled). To train this model we need some paired training examples as shown below:

Network Architecture

Here the network architecture consists of two models, generator and discriminator. First, take a look into the generator model.

Generally, a generator network in GAN architecture takes noise vector as input and generates an image as output. But here input consists of both noise vector and an image. So the network will be taking image as input and producing an image as output. In these types of problems generally, an encoder-decoder model is being used.

In an encoder-decoder network, first, the input is being down-sampled till a bottleneck layer and then upsampled to generate image again. In our problem of image-to-image translation, input and output differ in surface appearance but both have the same structure. So to make this encoder-decoder network-rich, the low-level information is shared between the input and output. For this, skip connections are added which forms an U-net architecture as shown in the above figure.

Here the discriminator model is a patchGAN. A patchGAN is nothing but a conv net. The only difference is that instead of mapping an input image to a single scalar vector, it maps to an NxN array. Where each individual element in NxN array maps to a patch in the input image. Finally, averaging is done to find the full input image is real or fake.

Reason for using patchGAN: The generator model is being trained using discriminator loss and also the L1 loss. It is well known that L1 losses produce blurry images. L1 losses fail to capture high frequencies in images while in many cases they are able to capture low frequencies. Now the task for discriminator will be only to capture high frequency. By straining the model’s attention to local image patches using patchGAN, it clearly helped in capturing high frequencies in the image.

Loss Function

Generally, loss function for a conditional GAN can be stated as follows:

Here generator G tries to minimize this loss function whereas discriminator D tries to maximize it. In the paper, authors have coupled it with L1 loss function such that the generator task is to not only fool the discriminator but also to generate ground truth near looking images. So final loos function would be:

Paper has suggested that this is a really promising approach in many image-to-image translation tasks but it always requires a paired training dataset which is sometimes difficult to get. That’s all for this blog, in the next blog we will implement its application (pix2pix) using keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Image-to-Image Translation with Conditional Adversarial Networks

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Capsule Networks

Since 2012 with the introduction of AlexNet, convolutional neural networks(CNNs) are being used as sole resource for many wide range image problems. Convolutional neural networks are able to perform really well in the field of image classification, object detection, semantic segmentation and many more.

Image Classification

But are CNNs best solution to solve image problems? Does they translate all features present in the image to predict the output?

Problems with Convolutional Neural Networks:

  1. CNNs uses pooling layers to reduce parameters so that it can speed up computation. In that process it looses some of its useful features information.
  2. CNNs also requires huge amount of dataset to train otherwise it will not give high accuracy in the test dataset.
  3. CNNs basically try to achieve “viewpoint invariance”. It means by changing input a little bit, output will not change. Also, CNNs do not store relative spatial relationship between features.

To solve these problems we need to find a better solution. That is where capsule network comes. A network which has given an early indication that it can solves problem associated with convolution neural networks. Recently, Geoffrey E. Hinton et. al. has published a paper named “Dynamic Routing Between Capsules”, in which they have introduced capsule network and dynamic routing algorithm.

What is a Capsule Network?

A capsule is a group of neurons which uses vectors to represent an object or object part. Length of a vector represents presence of an object and orientation of vector represents its pose(size, position, orientation, etc). Group of these capsules forms a capsule layer and then these layers lead to form a capsule network. It has some advantages over CNN.

  1. Capsule network tries to achieve “equivariance”. It means by changing input a little bit, output will also change but length of vector will remain same which will predict the presence of same object.
  2. Capsule Networks also requires less amount of data for training because it saves spatial relationship between features.
  3. Capsule network do not uses pooling layers which removes the problem of loosing useful features information.

How a Capsule Network works?

Usually in CNNs we deal with layers i.e. one layer passes information to subsequent layer and so on. CapsNet follows same flow as shown below.

Diagram shown above, represents network architecture used in the paper for MNIST dataset. Initial layer uses convolution to get low level features from image and pass them to a primary capsule layer.

A primary capsule layer reshapes output from previous convolution layer into capsules containing vectors of equal dimension. Length of each of these vector represents the probability of presence of an object, that is why we also need to use a non linear function “squashing” to change length of every vector between 0 and 1.

Where Sj is the input vector ||Sj|| is the norm of vector and vj is the output vector. And that will be the output of primary capsule layer. Capsules in the next layer are generated using dynamic routing algorithm. Which follows following algorithm.

Routing Algorithm:

The main feature of routing algorithm is the agreement between capsules. The lower level capsules will send values to higher level capsules if they agree to each other.

Let’s take an example of an image of a face. If there are four capsules in a lower layer each of which representing mouth, nose, left eye, and right eye respectively. And if all of these four agrees to same face position then it will send its values to the output layer capsule regarding there is a presence of a face.

To produce output for the routing capsules( capsules in the higher layer), firstly output from lower layer(u) is multiplied with weight matrix W and then it uses a coupling coefficient C. This C will determine which capsules form lower layer will send its output to which capsule in higher layer.

Coupling coefficient c is learned iteratively. The sum of all the c for a capsule ‘i’ in the lower layer is equal to 1. This maintains the probabilistic nature of vector that its length represents the probability of the presence of an object. C is determined by an applying softmax to weights b. Where initial values of b is taken to zero.

The routing agreement is determined by updating weights b by adding previous b to scalar product between current capsule in higher layer and capsule in lower layer( shown in line 7 in below algorithm)

Further to boost the capsule layer estimation, authors have added a decoder network to it. A decoder network tries to reconstruct the original image using an output of digit capsule layer. It is simply adding some fully connected layer to the output of 16-dimensional capsule layer.

Now we have seen basic concepts of a capsule network. To get more in depth knowledge about capsule network, the best way is to implement its code. Which you can see in the next blog.

The Next Blog : Implementing Capsule Network in Keras

Referenced Research Paper: Dynamic Routing Between Capsules

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Densely Connected Convolutional Networks – DenseNet

When we see a machine learning problem related to an image, the first things comes into our mind is CNN(convolutional neural networks). Different convolutional networks like LeNet, AlexNet, VGG16, VGG19, ResNet, etc. are used to solve different problems either it is supervised(classification) or unsupervised(image generation). Through these years there has been more deeper and deeper CNN architectures are used. As more complex problem comes, more deeper convolutional networks are preferred. But with deeper networks problem of vanishing gradient arises.

To solve this problem Gao Huang et al. introduced Dense Convolutional networks. DenseNets have several compelling advantages:

  1. alleviate the vanishing-gradient problem
  2. strengthen feature propagation
  3. encourage feature reuse, and substantially reduce the number of parameters.

How DenseNet works?

Recent researches like ResNet also tries to solve the problem of vanishing gradient. ResNet passes information from one layer to another layer via identity connection. In ResNet features are combined through summation before passing into the next layer.

While in DenseNet, it introduces connection from one layer to all its subsequent layer in a feed forward fashion (As shown in the figure below). This connection is done using concatenation not through summation.

source: DenseNet

ResNet architecture preserve information explicitly through identity connection, also recent variation of ResNet shows that many layers contribute very little and can in fact be randomly dropped during training. DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved.

In DenseNet, Each layer has direct access to the gradients from the loss function and the original input signal, leading to an r improved flow of information and gradients throughout the network, DenseNets have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.

An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, e.g., k = 12.  It refers to the hyperparameter k as the growth rate of the network. It means each layer in dense block will only produce k features. And these k features will be concatenated with previous layers features and will be given as input to the next layer.

DenseNet Architecture

The best way to illustrate any architecture is done with the help of code. So, I have implemented DenseNet architecture in Keras using MNIST data set.

A DenseNet consists of dense blocks. Each dense block consists of convolution layers. After a dense block a transition layer is added to proceed to next dense block (As shown in figure below).

Every layer in a dense block is directly connected to all its subsequent layers. Consequently, each layer receives the feature-maps of all preceding layer.

Each convolution layer is consist of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a 3 × 3 convolution (Conv). Also dropout can be added which depends on your architecture requirement.

An essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in DenseNet architecture it divides the network into multiple densely connected dense blocks(As shown in figure earlier).

The layers between blocks are transition layers, which do convolution and pooling. The transition layers consist of a batch normalization layer and an 1×1 convolutional layer followed by a 2×2 average pooling layer.

DenseNets can scale naturally to hundreds of layers, while exhibiting no optimization difficulties. Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features

The full code can be found here.

Referenced research paper: Densely Connected Convolutional Networks

Hope you enjoy reading. If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.