Category Archives: OCR

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

In the last blog, we have seen what is text detection and different types of algorithms to perform it, In this blog, we will learn more about text detection algorithms.

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  1. Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
  2. Download its pretrained model and put inside EAST directory.
  3. Before testing it you need to compile the lanms.
  4. To test this model, go to your EAST directory and then run following command from terminal:

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

  • This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
  • One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
  • Second 2k text/non-text scores and,
  • third is k side refinement offset.
  • Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
  • On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  • Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
  • Go to “text-detection-ctpn-banjin-dev” directory
  • Run following command one by one:
  • Download pretrained checkpoint from google drive
  • Extract it and put checkpoints_mlt/ in text-detection-ctpn/
  • Now put your text file in data/demo and output will be in data/res
  • Now run the following command to check the outputs

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

Source

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Recognition

In the previous blogs, we covered the OCR text detection step. Now, it’s time to move on to the OCR’s next pipeline component, which is Text Recognition. So, let’s get started.

Text Recognition

As you might remember, in the text detection step, we segmented out the text regions. Now, it’s time to recognize what text is present in those segments. This is known as Text Recognition. For instance, see the below image where we have segments on the left and the recognized text on the right. This is what we want, i.e. recognize the text present in the segments.

So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. In general, the Text Recognition step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. For instance, see the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Now, you may ask Why coordinates? This will become clear when we will discuss Restructuring (the next step).

Similar to text detection, text recognition has also been a long-standing research topic in computer vision. Traditional text recognition methods generally consist of 3 main steps

  • Image pre-processing
  • character segmentation
  • character recognition

That is they mainly work at a character level. But when we deal with images having a complex background, font, or other distortions, character segmentation becomes a really challenging task. Thus, to avoid character segmentation, two major techniques are adopted

  • Connectionist Temporal Classification (CTC) based
  • Attention-based

In the next blog, let’s understand in detail, what is CTC and how it is used in Text Recognition. Then we will move to the attention-based algorithms. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-1)

In the earlier blogs, we learned various stages of optical character recognition pipeline. In this blog, we will create a convolutional recurrent neural network with CTC (Connectionist Temporal Classification) loss to implement our recognition model.

We will use the following steps to create our text recognition model.

  • Collecting Dataset
  • Preprocessing Data
  • Creating Network Architecture
  • Defining Loss function
  • Training model
  • Decoding outputs from prediction

Dataset

In this blog, we will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. Here I have used only 135000 images for the training set and 15000 images for validation dataset. This data contains text image segments which look like images shown below:

To download the dataset either you can directly download from this link or use the following commands to download the data and unzip.

Preprocessing

Now we are having our dataset, to make it acceptable for our model we need to use some preprocessing. We need to preprocess both the input image and output labels. To preprocess our input image we will use followings:

  • Read the image and convert into a gray-scale image
  • Make each image of size (128,32) by using padding
  • Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
  • Normalize the image pixel values by dividing it with 255.

To preprocess the output labels use the followings:

  • Read the text from the name of the image as the image name contains text written inside the image.
  • Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26 etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
  • Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

In preprocessing step we also need to create two other lists: one is label length and other is input length to our RNN. These two lists are important for our CTC loss( we will see later ). Label length is the length of each output text label and input length is the same for each input to the LSTM layer which is 31 in our architecture.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Following is the code for our preprocessing step:

Now you might have got some feeling about the training and validation data generation for our recognition model. In the next blog, we will use this data to train and test our neural network.

Next Blog: Creating a CRNN model to recognize text in an image (Part-2)

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Generating Dataset

The first step to create any deep learning model is to generate the dataset. In continuation of our optical character recognition pipeline, in this blog, we will see how we can get our training and test data.

In our OCR pipeline first, we need to get data for both segmentation and recognition(text). For the segmentation part, data will consist of images and corresponding files containing coordinates for words present in the image. Let’s see an example.

For recognition part data will consist of images and their corresponding text files. Here segmented images will contain a single word.

Image and Text

Open Source Dataset:

There are some open source dataset available for our pipeline. For the segmentation part here are some useful open source datasets.

Now let’s see some of the open source dataset for text recognition(images and their corresponding texts)

Synthetic Data:

In some cases, training your OCR model with synthetic data can also be useful. You can create your own synthetic data using some python script. You can also add some geometric transformation to simulate the real world distortion into the data. For an example here is a script to generate synthetic data for text recognition:

In the above code, I have generated English words images and corresponding text files using different font types with a font size of 14. Segmented images will look like below:

Five Segmented Images generated from above code

Annotation Tools and Manual Data:

Another way to create segmentation text dataset is by using annotation tools. In this case, you need to collect images manually or you can get images from the internet, then you need to manually annotate text in the images (Bounding Boxes). Annotation tools like labelimg can work in this case.

That’s all to generate the dataset. In the next blog, we will see image preprocessing steps to apply to these datasets. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Image Preprocessing

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

OCR Pipeline

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Test image for OCR

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

OCR image pre-processing

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text detection

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Text recognition OCR

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Restructuring OCR

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.