Category Archives: Recent Researches

Implementing Capsule Network in Keras

In the last blog we have seen that what is a capsule network and how it can overcome the problems associated with convolutional neural network. In this blog we will implement a capsule network in keras.

You can find full code here.

Here, we will use handwritten digit dataset(MNIST) and train the capsule network to classify the digits. MNIST digit dataset consists of grayscale images of size 28*28.

Capsule Network architecture is somewhat similar to convolutional neural network except capsule layers. We can break the implementation of capsule network into following steps:

  1. Initial convolutional layer
  2. Primary capsule layer
  3. Digit capsule layer
  4. Decoder network
  5. Loss Functions
  6. Training and testing of model

Initial Convolution Layer:

Initially we will use a convolution layer to detect low level features of an image. It will use 256 filters each of size 9*9 with stride 1 and activation function is relu. Input size of image is 28*28, after applying this layer output size will be 20*20*256.

Primary Capsule Layer:

The output from the previous layer is being passed to 256 filters each of size 9*9 with a stride of 2 which will produce an output of size 6*6*256. This output is then reshaped into 8-dimensional vector. So shape will be 6*6*32 capsules each of which will be 8-dimensional. Then it will pass through a non-linear function(squash) so that length of output vector can be maintained between 0 and 1.

Digit Capsule Layer:

Logic and algorithm used for this layer is explained in the previous blog. Here we will see what we need to do in code to implement it. We need to write a custom layer in keras. It will take 1152*8 as its input and produces output of size 10*16, where 10 capsules each represents an output class with 16 dimensional vector. Then each of these 10 capsules are converted into single value to predict the output class using a lambda layer.

Decoder Network:

To further boost the pose parameters learned by the digit capsule layer, we can add decoder network to reconstruct the input image. In this part, decoder network will be fed with an input of size 10*16 (digit capsule layer output) and will reconstruct back the original image of size 28*28. Decoder will consist of 3 dense layer having 512, 1024 and 784 nodes.

During training time input to the decoder is the output from digit capsule layer which is masked with original labels. It means that other vectors except the vector corresponding to correct label will be multiplied with zero. So that decoder can only be trained with correct digit capsule. In test time input to decoder will be the same output from digit capsule layer but masked with highest length vector in that layer. Lets see the code.

Loss Functions:

It uses two loss function one is probabilistic loss function used for classifying digits image and another is reconstruction loss which is mean squared error. Lets see probabilistic loss which is simple to understand once you look at following code.

Training and Testing of model:

Now define our training and testing model and train it on MNIST digit dataset.

In test data set it was able to achieve 99.09% accuracy. Pretty good yeah! Also reconstructed images looks good. Here are the reconstructed images generated by decoder network.

Capsule Network comes with promising results and yet to be explored thoroughly. There are various bits and bytes where it can be explored. Research on a capsule network is still in an early stage but it has given clear indication that it is worth exploring.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Capsule Networks

Since 2012 with the introduction of AlexNet, convolutional neural networks(CNNs) are being used as sole resource for many wide range image problems. Convolutional neural networks are able to perform really well in the field of image classification, object detection, semantic segmentation and many more.

Image Classification

But are CNNs best solution to solve image problems? Does they translate all features present in the image to predict the output?

Problems with Convolutional Neural Networks:

  1. CNNs uses pooling layers to reduce parameters so that it can speed up computation. In that process it looses some of its useful features information.
  2. CNNs also requires huge amount of dataset to train otherwise it will not give high accuracy in the test dataset.
  3. CNNs basically try to achieve “viewpoint invariance”. It means by changing input a little bit, output will not change. Also, CNNs do not store relative spatial relationship between features.

To solve these problems we need to find a better solution. That is where capsule network comes. A network which has given an early indication that it can solves problem associated with convolution neural networks. Recently, Geoffrey E. Hinton et. al. has published a paper named “Dynamic Routing Between Capsules”, in which they have introduced capsule network and dynamic routing algorithm.

What is a Capsule Network?

A capsule is a group of neurons which uses vectors to represent an object or object part. Length of a vector represents presence of an object and orientation of vector represents its pose(size, position, orientation, etc). Group of these capsules forms a capsule layer and then these layers lead to form a capsule network. It has some advantages over CNN.

  1. Capsule network tries to achieve “equivariance”. It means by changing input a little bit, output will also change but length of vector will remain same which will predict the presence of same object.
  2. Capsule Networks also requires less amount of data for training because it saves spatial relationship between features.
  3. Capsule network do not uses pooling layers which removes the problem of loosing useful features information.

How a Capsule Network works?

Usually in CNNs we deal with layers i.e. one layer passes information to subsequent layer and so on. CapsNet follows same flow as shown below.

Diagram shown above, represents network architecture used in the paper for MNIST dataset. Initial layer uses convolution to get low level features from image and pass them to a primary capsule layer.

A primary capsule layer reshapes output from previous convolution layer into capsules containing vectors of equal dimension. Length of each of these vector represents the probability of presence of an object, that is why we also need to use a non linear function “squashing” to change length of every vector between 0 and 1.

Where Sj is the input vector ||Sj|| is the norm of vector and vj is the output vector. And that will be the output of primary capsule layer. Capsules in the next layer are generated using dynamic routing algorithm. Which follows following algorithm.

Routing Algorithm:

The main feature of routing algorithm is the agreement between capsules. The lower level capsules will send values to higher level capsules if they agree to each other.

Let’s take an example of an image of a face. If there are four capsules in a lower layer each of which representing mouth, nose, left eye, and right eye respectively. And if all of these four agrees to same face position then it will send its values to the output layer capsule regarding there is a presence of a face.

To produce output for the routing capsules( capsules in the higher layer), firstly output from lower layer(u) is multiplied with weight matrix W and then it uses a coupling coefficient C. This C will determine which capsules form lower layer will send its output to which capsule in higher layer.

Coupling coefficient c is learned iteratively. The sum of all the c for a capsule ‘i’ in the lower layer is equal to 1. This maintains the probabilistic nature of vector that its length represents the probability of the presence of an object. C is determined by an applying softmax to weights b. Where initial values of b is taken to zero.

The routing agreement is determined by updating weights b by adding previous b to scalar product between current capsule in higher layer and capsule in lower layer( shown in line 7 in below algorithm)

Further to boost the capsule layer estimation, authors have added a decoder network to it. A decoder network tries to reconstruct the original image using an output of digit capsule layer. It is simply adding some fully connected layer to the output of 16-dimensional capsule layer.

Now we have seen basic concepts of a capsule network. To get more in depth knowledge about capsule network, the best way is to implement its code. Which you can see in the next blog.

The Next Blog : Implementing Capsule Network in Keras

Referenced Research Paper: Dynamic Routing Between Capsules

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Compression of data using Autoencoders

In the last blog, we discussed what autoencoders are. In this blog, we will learn, how autoencoders can be used to compress data and reconstruct back the original data.

Here I have used MNIST dataset. First, I have downloaded MNIST dataset which is having digits images(0 to 9), a total of size 45 MB. Let’s, see the code to download data using python.

Since we want to compress the dataset and reconstruct back it into original data, first we have to create a convolutional autoencoder. Let’s see code:

From this autoencoder model, I have created encoder and decoder model. Encoder model will compress the data and decoder model will be used while reconstructing original data. Then trained the auotoencoder model.

Using encoder model we can save compressed data into a text file. Which having size of 18 MB( Much less then original size 45 MB).

Now next thing is how we can reconstruct this compressed data when original data is needed. The simple solution is, we can save our decoder model and its weight which will be used further to reconstruct this compressed data. Let’s save decoder model and it’s weights.

Finally we are having our compressed data and decoder model. Let’s see code how we can simply reconstruct back using these two.

Above are our output from decoder model.

It looks fascinating to compress data to less size and get same data back when we need, but there are some real problem with this method.

The problem is autoencoders can not generalize. Autoencoders can only reconstruct images for which these are trained. But with the advancement in deep learning those days are not far away when you will use this type compression using deep learning.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Sparse Autoencoders

In the last blog we have seen autoencoders and its applications. In this blog we will learn one of its variant, sparse autoencoders.

In every autoencoder, we try to learn compressed representation of the input. Let’s take an example of a simple autoencoder having input vector dimension of 1000, compressed into 500 hidden units and reconstructed back into 1000 outputs. The hidden units will learn correlated features present in the input. But what if input features are completely random? Then it will we difficult for hidden units to learn interesting structure present in data. In that situation what we can do is increase the number of hidden units and add some sparsity constraints. Now the question is what are sparsity constraints?

When sparsity constraints added to a hidden unit, it only activates some units (having large activation values) and makes rest to zero. So, even if we are having a large number of hidden units( as in the above example), it will only fire some hidden units and learn useful structure present in the data.

The simplest implementation of sparsity constraints can be done in keras. You can simple add activity_regularizer to a layer (see line 11) and it will do the rest.

But, if you want to add sparse constraints by writing your own function, you can follow reference given below.

References: Sparse Autoencoders

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Is the deconvolution layer the same as a convolutional layer?

Isn’t this an interesting topic? If you have worked with image classification problems( e.g. classifying cats and dogs) or image generation problems( e.g. GANs, autoencoders), surely you have encountered with convolution and deconvolution layer. But what if someone says a deconvolution layer is same as a convolution layer.

This paper has proposed an efficient subpixel convolution layer which works same as a deconvolution layer. To understand this, lets first understand convolution layer , transposed convolution layer and sub pixel convolution layer.

Convolution Layer

In every convolution neural network, convolution layer is the most important part. A convolution layer is consist of numbers of independent filters which convolve independently with input and produce output for the next layer. Let’s see how a filter convolve with the input.

Transposed and sub pixel Convolution Layer

Transposed convolution is the inverse operation of convolution. In convolution layer, you try to extract useful features from input while in transposed convolution, you try to add some useful features to upscale an image. Transposed convolution has learnable features which are learnt using backpropogation. Lets see how to do a transposed convolution visually.

Similarly, a subpixel convolution is also used for upsampling an image. It uses fractional strides( input is padded with in-between zero pixels) to an input and outputs an upsampled image. Let’s see visually.

An efficient sub pixel convolution Layer

In this paper authors have proposed that upsampling using deconvolution layer isn’t really necessary. So they came up with this Idea. Instead of putting in between zero pixels in the input image, they do more convolution in lower resolution image and then apply periodic shuffling to produce an upscaled image.

Source
r denotes the up scaling ratio

Authors have illustrated that deconvolution layer with kernel size of (o, i, k*r , k*r ) is same as convolution layer with kernel size of (o*r *r, i, k, k) e.g. (output channels, input channels, kernel width, kernel height) in LR space. Let’s take an example of proposed efficient subpixel convolution layer.

Source

In the above figure, input image shape is (1, 4, 4) and upscaling ratio(r) is 2. To achieve an image of size (1, 8, 8), first input image is applied with kernel size of (4, 1, 2, 2) which produces output of shape (4, 4, 4) and then periodic shufling is applied to get required upscaled image of shape (1, 8, 8). So instead of using deconvolution layer with kernel size of (1, 1, 4, 4) same can be done with this efficient sub pixel convolution layer.

Implementation

I have also implemented an autoencoder(using MNIST dataset) with efficient subpixel convolution layer. Let’s see the code for efficient subpixel convolution.

The above periodic shuffling code is given by this github link. Then applied autoencoder layers to generate image. To up-sample image in decoder layers first convolved encoded images then used periodical shuffling.

This type of subpixel convolution layers can be very helpful in problems like image generation( autoencoders, GANs), image enhancement(super resolution). Also there is more to find out what can this efficient subpixel convolution layer offers.

Now, you might have got some feeling about efficient subpixel convolution layer. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper : Is the deconvolution layer the same as a convolutional layer?

Referenced Gitub Link : Subpixel

On Calibration of Modern Neural Networks

Nowadays neural networks are having vast applicability and these are trusted to make complex decisions in applications such as, medical diagnosis, speech recognition, object recognition and optical character recognition. Due to more and more research in deep learning, neural networks accuracy has been improved dramatically.

With the improvement in accuracy, neural network should also be confident in saying when they are likely to be incorrect. As an example, if confidence given by a neural network for disease diagnosis is low, control should be passed to human doctors.

Now what is confidence score in neural network? It is the probability estimate produced by the neural network. Let say, you are working on a multi-class classification task. After applying softmax layer you found out that a particular class is having highest probability with value of 0.7 . It means that you are 70% confident that this should be your actual output.

Here we intuitively mean that, for 100 predictions if average confidence score is 0.8, 80 should be correctly classified. But modern neural networks are poorly calibrated. As you can see in figure there is larger gap between average confidence score and accuracy for ResNet while less for LeNet.

Source

In the paper, author has addresses the followings:

  1. What methods are alleviating poor calibration problem in neural networks.
  2. A simple and straightforward solution to reduce this problem.

Observing Miscalibration:

With the advancement in deep neural networks some recent changes are responsible for miscalibration. 

  1. Model Capacity:  Although increasing depth and width of neural networks may reduce classification error, but in paper they have observed that these increases negatively affect model calibration.
  2. Batch Normalization: Batch Normalization improves training time, reduces the need for additional regularization, and can in some cases improve the accuracy of networks. It has been observed that models trained with Batch Normalization tend to be more miscalibrated.
  3. Weight Decay: It has been found that that training with less weight decay has a negative impact on calibration.

Temperature Scaling:

Temperature scaling works well to calibrate computer vision models. It is a simplest extension of Platt scaling. To understand temprature scaling we will first see Platt scaling.

Platt Scaling: This method is used for calibrating models. It uses logistic regression to return the calibrated probabilities of a model. Let say you are working on a multi-class classification task and trained it on some training data. Now Platt scaling will take logits(output from trained network before applying softmax layer using validation dataset) as input to logistic regression model. Then Platt scaling will be trained on validation dataset and learns scalar parameters a, b ∈ R and outputs q = σ(az + b) as the calibrated probability(where z are logits.).

Temperature scaling is an extension of Platt scaling having a trainable single parameter T>0 for all classes. T is called the temperature. T is trained with validation dataset not on training dataset. Because if we train T during training, network would learn to make the temperature as low as possible so that it can be very confident on training dataset.

Then temperature will be applied directly to softmax layer by dividing logits with T ( z/T ) and then trained on validation dataset. After adjusting temperature parameter on validation dataset, it will give trained parameter T, which we can use to divide logits and then apply softmax layer to find calibrated probabilities during test data. Now, lets see a simple TensorFlow code to implement temperature scaling.

Simple techniques can effectively remedy the miscalibration phenomenon in neural networks. Temperature scaling is the simplest, fastest, and most straightforward of the methods,and surprisingly is often the most effective. 

Referenced Research Paper : On Calibration of Modern Neural Networks   

GitHub: Temperature Scaling  

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Densely Connected Convolutional Networks – DenseNet

When we see a machine learning problem related to an image, the first things comes into our mind is CNN(convolutional neural networks). Different convolutional networks like LeNet, AlexNet, VGG16, VGG19, ResNet, etc. are used to solve different problems either it is supervised(classification) or unsupervised(image generation). Through these years there has been more deeper and deeper CNN architectures are used. As more complex problem comes, more deeper convolutional networks are preferred. But with deeper networks problem of vanishing gradient arises.

To solve this problem Gao Huang et al. introduced Dense Convolutional networks. DenseNets have several compelling advantages:

  1. alleviate the vanishing-gradient problem
  2. strengthen feature propagation
  3. encourage feature reuse, and substantially reduce the number of parameters.

How DenseNet works?

Recent researches like ResNet also tries to solve the problem of vanishing gradient. ResNet passes information from one layer to another layer via identity connection. In ResNet features are combined through summation before passing into the next layer.

While in DenseNet, it introduces connection from one layer to all its subsequent layer in a feed forward fashion (As shown in the figure below). This connection is done using concatenation not through summation.

source: DenseNet

ResNet architecture preserve information explicitly through identity connection, also recent variation of ResNet shows that many layers contribute very little and can in fact be randomly dropped during training. DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved.

In DenseNet, Each layer has direct access to the gradients from the loss function and the original input signal, leading to an r improved flow of information and gradients throughout the network, DenseNets have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.

An important difference between DenseNet and existing network architectures is that DenseNet can have very narrow layers, e.g., k = 12.  It refers to the hyperparameter k as the growth rate of the network. It means each layer in dense block will only produce k features. And these k features will be concatenated with previous layers features and will be given as input to the next layer.

DenseNet Architecture

The best way to illustrate any architecture is done with the help of code. So, I have implemented DenseNet architecture in Keras using MNIST data set.

A DenseNet consists of dense blocks. Each dense block consists of convolution layers. After a dense block a transition layer is added to proceed to next dense block (As shown in figure below).

Every layer in a dense block is directly connected to all its subsequent layers. Consequently, each layer receives the feature-maps of all preceding layer.

Each convolution layer is consist of three consecutive operations: batch normalization (BN) , followed by a rectified linear unit (ReLU) and a 3 × 3 convolution (Conv). Also dropout can be added which depends on your architecture requirement.

An essential part of convolutional networks is down-sampling layers that change the size of feature-maps. To facilitate down-sampling in DenseNet architecture it divides the network into multiple densely connected dense blocks(As shown in figure earlier).

The layers between blocks are transition layers, which do convolution and pooling. The transition layers consist of a batch normalization layer and an 1×1 convolutional layer followed by a 2×2 average pooling layer.

DenseNets can scale naturally to hundreds of layers, while exhibiting no optimization difficulties. Because of their compact internal representations and reduced feature redundancy, DenseNets may be good feature extractors for various computer vision tasks that build on convolutional features

The full code can be found here.

Referenced research paper: Densely Connected Convolutional Networks

Hope you enjoy reading. If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Variational Autoencoders

Variational autoencoders are an extension of autoencoders and used as generative models. You can generate data like text, images and even music with the help of variational autoencoders.

Autoencoders are the neural network used to reconstruct original input. To know more about autoencoders please got through this blog. They have a certain application like denoising autoencoders and dimensionality reduction for data visualization. But apart from that, they are fairly limited.

To overcome this limitation, variational autoencoders comes into place. A common autoencoder learns a function which does not train autoencoder to generate images from a particular distribution. Also, if you try to create a generative model using autoencoders, you do not want to generate data as therein input. You want the output data with some variations which mostly look like input data.

Variational Autoencoder Model

A variational autoencoder has encoder and decoder part mostly same as autoencoders, the difference is instead of creating a compact distribution from its encoder, it learns a latent variable model. These latent variables are used to create a probability distribution from which input for the decoder is generated. Another is, instead of using mean squared or cross entropy loss function (as in autoencoders ) it has its own loss function.

I will not go further into the mathematics behind it, Lets jump into the code which will give more understanding about variational autoencoders. To know more about the mathematics behind it please go through this tutorial.

I have implemented variational autoencoder in keras using MNIST dataset. So lets first download the data.

Now create an encoder model as it is created in autoencoders.

Latent Distribution Parameters and Function

Now encode the output of the encoder to latent distribution parameters. Here, I have created two parameters mu and sigma which represents the mean and standard distribution of the distribution.

Here I have taken latent space dimension equal to 2. This is the bottleneck which means we are passing our entire set of data to two single variables. So if we increase our latent space dimension to 5, 10 or higher, we can get better results in the output. But this will create more data in the bottleneck.

Now create a Gaussian distribution function with mean zero and standard deviation of 1. This distribution will give variation in the input to the decoder, which will help to get variation in the output. Then decoder will predict the output using distribution.

Loss Function

For the loss function, a variational autoencoder uses the sum of two losses, one is the generative loss which is a binary cross entropy loss and measures how accurately the image is predicted, another is the latent loss, which is KL divergence loss, measures how closely a latent variable match Gaussian distribution. This KL divergence makes sure that our distribution generated from encoder do not go away from the origin. Then train the model.

Our model is ready and we can generate images from it very easily. All we need to do is sample latent variable from distribution and pass it to the decoder. Lets test with the following code:

Here is the output generated from sampled distribution in the above code.

The full code can be find here.

Hope you understand the basics of variational autoencoders. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced papers: Auto-Encoding Variational BayesTutorial on Variational Autoencoders

Denoising Autoencoders

In my previous blog, we have discussed what is an autoencoder, its applications and a simple implementation in keras. In this blog, we will see a variant of autoencoder – ‘ denoising autoencoders ‘.

A denoising autoencoder is an extension of autoencoders. An autoencoder tries to learn identity function( output equals to input ), which makes it risking to not learn useful feature. One method to overcome this problem is to use denoising autoencoders.

For training a denoising autoencoder, we need to use noisy input data. For that, we need to add some noise to an original image. The amount of corrupting data depends on the amount of information present in data. Usually, 25-30 % data is being corrupted. This can be higher if your data contains less information. Let see how you can add noise to data in code:

To calculate loss, the output of the denoising autoencoder is then compared to original input instead of the corrupted one. Such a loss function train model to learn interesting features rather than learning identity function.

I have implemented denoising autoencoder in keras using MNIST data, which will give you an overview, how a denoising autoencoder works.

following is the result of denoising autoencoder.

The full code can be find here.

Hope you understand the usefulness of denoising autoencoder. In the next blog, we will feature variational autoencoders. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Autoencoders

Let’s start with a simple definition of autoencoders. ‘ Autoencoders are the neural networks trained to reconstruct their original input’.

Now, you might be thinking what’s the use of reconstructing same data. Let me give you an example If you want to transfer data of GB’s of size and somehow if you can compress it into MB’s and then able to reconstruct back the data to the original size, isn’t that a better way to transfer data. This is one of the applications of autoencoders.

Autoencoders generally consists of two parts, one is encoder and other is decoder. Encoder downscale data to less number of features and decoder upscale the extracted features to original one.

There are some practical applications of autoencoders:

  1. Dimensionality reduction for data visualization
  2. Image Denoising
  3. Generative Models

Visualizing a 10-dimensional vector is difficult. To overcome this problem we need to reduce that 10-dimensional vector into 2-D or 3-D. One of the famous algorithm PCA (Principal Component Analysis) tries to solve this problem. PCA uses linear transformations while autoencoders can use both linear and non-linear transformations for dimensionality reduction. Which makes autoencoders to generate more complex and interesting features than PCA.

Autoencoders can be used to remove the noise present in the image. It can also be used to generate new images required for a specific task. We will see more about these two applications in the next blog.

Now, let’s start with the simple implementation of autoencoders in Keras using MNIST data. First, let’s download MNIST training and test data and reshape it.

Encoder

MNIST data consists of images of digits. So, it is better to use a convolutional neural network in our encoders and decoders. In our encoder, I have used conv and max-pooling layers to extract the compressed representation. Then flatten the encoder output to 32 features. Which will be the input to the decoder.

Decoder

In the decoder, we need to upsample the extracted 32 features into the original size of the image. To achieve this, I have used Conv2DTranspose functions from keras. Then the final layer of the decoder will give the reconstructed output which will be similar to the original input.

To minimize reconstruction loss, we train the network with a large dataset and update weights. Now, our model is created, the next thing is to compile and train the model.

Below are the results from autoencoder trained above. The first line of digits shows the original input (test images) while the second line represents the reconstructed inputs from the model.

The full code can be find here.

Hope you understand the basics of autoencoders, where these can be used and how a simple autoencoder be implemented. In the next blog, we will see how to denoise an image using autoencoders. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: http://proceedings.mlr.press/v27/baldi12a/baldi12a.pdf