In the previous blog, we had an overview of the text recognition step. There we discussed that in order to avoid character segmentation, two major techniques have been adopted. One is CTC-based and another one is Attention-based. So, in this blog, let’s first discuss the intuition behind the CTC algorithm like why do we even need this or where is this algorithm used. And then in the next blog, we will discuss this algorithm in detail. Here, we will understand this using the Text Recognition case. Let’s get started.
As we have already discussed that in text recognition, we are given a segmented image and our task is to recognize what text is present in that segment. Thus, for the text recognition problem the input is an image while the output is a text as shown below.
So, in order to solve the text recognition problem, we need to develop a model that takes the image as an input and outputs the recognized text. If you have ever taken any deep learning class, you must know that the Convolutional Neural Networks (CNNs) are good in handling image data, while for the sequence data such as text, Recurrent Neural Networks (RNNs) are preferred.
So, for the text recognition problem, an obvious choice would be to use a combination of Convolutional Neural Network and Recurrent Neural Network. Now, let’s discuss how to combine CNN and RNN together for the text recognition task. Below is one such architecture that combines the CNN and RNN together. This is taken from the famous CRNN paper.
In this, first, the input image is fed through a number of convolutional layers to extract the feature maps. These feature maps are then divided into a sequence of feature vectors as shown by the blue color. These are obtained by dividing the feature maps into columns of single-pixel width. Now, a question might come to your mind that why are we dividing the feature maps by columns. The answer to this question lies in the receptive field concept. The receptive field is defined as the region in the input image that a particular CNN’s feature map is looking at. For instance, for the above input image, the receptive field of each feature vector corresponds to a rectangular region in the input image as shown below.
Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.
And each of these rectangular regions is ordered from left to right. Thus, each feature vector can be considered as the image descriptor of that rectangular region. These feature vectors are then fed to a bi-directional LSTM. Because of the softmax activation function, this LSTM layer outputs the probability distribution at each time step over the character set. To obtain the per-timestep output, we can either take the max of the probability distribution at each time step or apply any other method.
But as you might have noticed in the above image that these feature vectors sometimes may not contain the complete character. For instance, see the below image where the 2 feature vectors marked by red color contains some part of the character “S”.
Thus, in the LSTM output, we may get repeated characters as shown below by the red box. We call these per-frame or per-timestep predictions.
Now, here comes the problem. As we have already discussed, that for the text recognition, the training data consists of images and the corresponding text as shown below.
Thus, we only know the final output and we don’t know the per-timestep predictions. Now, in order to train this network, we either need to know the per-timestep output for each input image or we need to develop a mechanism to convert either per-timestep output to final output or vice-versa.
So, the problem is how to align the final output with the per-timestep predictions in order to train this network?
Approach 1
One thing we can do is devise a rule like “one character corresponds to some fixed time steps”. For instance, for the above image, if we have 10 timesteps, then we can repeat “State” as “S, S, T, T, A, A, T, T, E, E” (repeat each character twice) and then train the network. But this approach can be easily violated for different fonts, writing styles, etc.
Approach 2
Another approach can be to manually annotate the data for each time step as shown below. Then train the network using this annotated data. The problem with this approach is that this will be very time consuming for a reasonably sized dataset.
Clearly, both the above naïve approaches have some serious downsides. So, isn’t there any efficient way to solve this? This is where the CTC comes into picture.
Connectionist Temporal Classification(CTC)
This was introduced in 2006 and is used for training deep networks where alignment is a problem. With CTC, we need not to worry about the alignments or the per-timestep predictions. CTC takes care of all the alignments and now we can train the network only using the final outputs. So, in the next blog, let’s discuss how the CTC algorithm works. Till then, have a great time. Hope you enjoy reading.
If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.