Tag Archives: LSTM

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Neural Arithmetic Logic Units

In this tutorial, you will learn about neural arithmetic logic units (NALU).

You can see full TensorFlow implementation of neural arithmetic logic units in my GitHub Repository.

In Present time you can see neural network has wide application area from simple classification problem to complex self-driving cars problem. And neural network is doing very well in these fields. But can you think a neural network can’t count. Even animals as simple as bees can do that.

The problem is that a neural network can not perform numerical extrapolation outside training data. It will not be able to learn even a scalar identity function outside it’s training set. Recently DeepMind researchers released a paper in which they have generated a function which tries to solve this problem.

Failure Of Neural Network Learning a scalar identity function

The problem of neural nets not able to learn identity relations is not new. But in paper they tried to show this using an example.

They used an autoencoder of 3 layers each of 8 units and tried to predict identity relations. As an example if input is given as 4 then output should also be 4. They used different non-linear functions in this network like sigmoid and tanh but they all fail to extrapolate identity relation outside training data set.

They also saw that some highly linear function like PReLU are able to reduce the error but you can see even neural nets have function that are capable of extrapolation, they fail to do it.

To solve this problem they proposed two models:

  1. NAC (Neural Accumulator)
  2. NALU (Neural Arithmetic Logic Units)

NAC (Neural Accumulator)

Neural accumulator is able to solve problem of addition and subtraction.

NAC is a special case where transformation matrix of a layer consists of [-1, 0,1]. This makes output from W as addition or subtraction of rows of input vector rather than arbitrary re scaling produced by non-linear functions. As an example if our input layer consists of X1  and X2 then output from NAC will be linear combinations of input vectors. It will make numbers consistent throughout the model no matter haw many operations are applied there.

Since W is having hard constraints that every element of W should be one of {-1, 0, 1}. It makes learning difficult. The problem of difficult learning is that hard constraints creates difficulty in updating weights during back propagation. To solve this they proposed a continuous and differentiable parameterization of W.

 

w_hat and m_hat are randomly initialized weights and be convenient to learn with gradient descent. This guarantees that W will be in range of {-1, 1} and be close to {-1 , 0 ,1}. Here ” * ” means element-wise multiplication.

NALU (Neural Arithmetic Logic Units)

NAC is able to solve problem of addition/subtraction but also to solve problem of multiplication/division they came up with NALU which consists of two NAC sub cells, one capable of addition/subtraction and other for multiplication/division.

It consists of these five equations:

Where,

  1.  w_hat, m_hat and G and randomly initialized weights,
  2.  ϵ is used to get away with problem of log(0),
  3.  x and y are input and output layer respectively,
  4. g is the gate which will be between 0 and 1.

Here the concept of gate is being added as variable ” g ” , such that if output value is applied with g = 1(on) then multiply/divide sub cell is 0 (off) and vice-versa.

For addition and subtraction (a = matmul(x, W) ), is identical to original NAC while for multiply/divide NAC operates in log space and capable of learning to multiply and divide (m = exp(matmul(W , log(|x|+ ϵ ) ) ).

So, this NALU is capable of both extrapolation and interpolation.

Experiments performed with NAC and NALU models

In paper they have also applied these concepts over different task to see ability of NAC and NALU. They found NALU to be very useful in different problems like:

  1. Learning tasks using different arithmetic functions( x+y, x-y, x-y+x, x*y, etc)
  2. Counting Task using recurrent network in which images of different digits are being fed to model and output should count no of each different type of digits.
  3. Language to number translation task in which expression like ” five hundred fifteen ” is being fed to network and output should return ” 515 “. Here NALU is applied with a LSTM model in output layer.
  4. NALU is also used with reinforcement learning to track time in a grid-world environment.

Summary

We have seen that NAC and NALU can be applied to overcome problem of failure of numerical representation to generalize outside the range observed in training data set. If you have gone through this blog, you have seen that this NAC and NALU concept is very easy to grasp and apply. However, it can not be said that NALU will be perfect for every task, so we have to see where it is giving good results.

Referenced Research Paper : Neural Arithmetic Logic Units