Tag Archives: ocr

Optical Character Recognition Pipeline: Text Recognition

In the previous blogs, we covered the OCR text detection step. Now, it’s time to move on to the OCR’s next pipeline component, which is Text Recognition. So, let’s get started.

Text Recognition

As you might remember, in the text detection step, we segmented out the text regions. Now, it’s time to recognize what text is present in those segments. This is known as Text Recognition. For instance, see the below image where we have segments on the left and the recognized text on the right. This is what we want, i.e. recognize the text present in the segments.

So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. In general, the Text Recognition step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. For instance, see the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Now, you may ask Why coordinates? This will become clear when we will discuss Restructuring (the next step).

Similar to text detection, text recognition has also been a long-standing research topic in computer vision. Traditional text recognition methods generally consist of 3 main steps

  • Image pre-processing
  • character segmentation
  • character recognition

That is they mainly work at a character level. But when we deal with images having a complex background, font, or other distortions, character segmentation becomes a really challenging task. Thus, to avoid character segmentation, two major techniques are adopted

  • Connectionist Temporal Classification (CTC) based
  • Attention-based

In the next blog, let’s understand in detail, what is CTC and how it is used in Text Recognition. Then we will move to the attention-based algorithms. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-1)

In the earlier blogs, we learned various stages of optical character recognition pipeline. In this blog, we will create a convolutional recurrent neural network with CTC (Connectionist Temporal Classification) loss to implement our recognition model.

We will use the following steps to create our text recognition model.

  • Collecting Dataset
  • Preprocessing Data
  • Creating Network Architecture
  • Defining Loss function
  • Training model
  • Decoding outputs from prediction

Dataset

In this blog, we will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. Here I have used only 135000 images for the training set and 15000 images for validation dataset. This data contains text image segments which look like images shown below:

To download the dataset either you can directly download from this link or use the following commands to download the data and unzip.

Preprocessing

Now we are having our dataset, to make it acceptable for our model we need to use some preprocessing. We need to preprocess both the input image and output labels. To preprocess our input image we will use followings:

  • Read the image and convert into a gray-scale image
  • Make each image of size (128,32) by using padding
  • Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
  • Normalize the image pixel values by dividing it with 255.

To preprocess the output labels use the followings:

  • Read the text from the name of the image as the image name contains text written inside the image.
  • Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26 etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
  • Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

In preprocessing step we also need to create two other lists: one is label length and other is input length to our RNN. These two lists are important for our CTC loss( we will see later ). Label length is the length of each output text label and input length is the same for each input to the LSTM layer which is 31 in our architecture.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Following is the code for our preprocessing step:

Now you might have got some feeling about the training and validation data generation for our recognition model. In the next blog, we will use this data to train and test our neural network.

Next Blog: Creating a CRNN model to recognize text in an image (Part-2)

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

OCR Pipeline

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Test image for OCR

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

OCR image pre-processing

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text detection

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Text recognition OCR

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Restructuring OCR

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.