Tag Archives: python

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

In the last blog, we have seen what is text detection and different types of algorithms to perform it, In this blog, we will learn more about text detection algorithms.

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  1. Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
  2. Download its pretrained model and put inside EAST directory.
  3. Before testing it you need to compile the lanms.
  4. To test this model, go to your EAST directory and then run following command from terminal:

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

  • This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
  • One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
  • Second 2k text/non-text scores and,
  • third is k side refinement offset.
  • Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
  • On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  • Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
  • Go to “text-detection-ctpn-banjin-dev” directory
  • Run following command one by one:
  • Download pretrained checkpoint from google drive
  • Extract it and put checkpoints_mlt/ in text-detection-ctpn/
  • Now put your text file in data/demo and output will be in data/res
  • Now run the following command to check the outputs

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

Source

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-1)

In the earlier blogs, we learned various stages of optical character recognition pipeline. In this blog, we will create a convolutional recurrent neural network with CTC (Connectionist Temporal Classification) loss to implement our recognition model.

We will use the following steps to create our text recognition model.

  • Collecting Dataset
  • Preprocessing Data
  • Creating Network Architecture
  • Defining Loss function
  • Training model
  • Decoding outputs from prediction

Dataset

In this blog, we will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. Here I have used only 135000 images for the training set and 15000 images for validation dataset. This data contains text image segments which look like images shown below:

To download the dataset either you can directly download from this link or use the following commands to download the data and unzip.

Preprocessing

Now we are having our dataset, to make it acceptable for our model we need to use some preprocessing. We need to preprocess both the input image and output labels. To preprocess our input image we will use followings:

  • Read the image and convert into a gray-scale image
  • Make each image of size (128,32) by using padding
  • Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
  • Normalize the image pixel values by dividing it with 255.

To preprocess the output labels use the followings:

  • Read the text from the name of the image as the image name contains text written inside the image.
  • Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26 etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
  • Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

In preprocessing step we also need to create two other lists: one is label length and other is input length to our RNN. These two lists are important for our CTC loss( we will see later ). Label length is the length of each output text label and input length is the same for each input to the LSTM layer which is 31 in our architecture.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Following is the code for our preprocessing step:

Now you might have got some feeling about the training and validation data generation for our recognition model. In the next blog, we will use this data to train and test our neural network.

Next Blog: Creating a CRNN model to recognize text in an image (Part-2)

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Generating Dataset

The first step to create any deep learning model is to generate the dataset. In continuation of our optical character recognition pipeline, in this blog, we will see how we can get our training and test data.

In our OCR pipeline first, we need to get data for both segmentation and recognition(text). For the segmentation part, data will consist of images and corresponding files containing coordinates for words present in the image. Let’s see an example.

For recognition part data will consist of images and their corresponding text files. Here segmented images will contain a single word.

Image and Text

Open Source Dataset:

There are some open source dataset available for our pipeline. For the segmentation part here are some useful open source datasets.

Now let’s see some of the open source dataset for text recognition(images and their corresponding texts)

Synthetic Data:

In some cases, training your OCR model with synthetic data can also be useful. You can create your own synthetic data using some python script. You can also add some geometric transformation to simulate the real world distortion into the data. For an example here is a script to generate synthetic data for text recognition:

In the above code, I have generated English words images and corresponding text files using different font types with a font size of 14. Segmented images will look like below:

Five Segmented Images generated from above code

Annotation Tools and Manual Data:

Another way to create segmentation text dataset is by using annotation tools. In this case, you need to collect images manually or you can get images from the internet, then you need to manually annotate text in the images (Bounding Boxes). Annotation tools like labelimg can work in this case.

That’s all to generate the dataset. In the next blog, we will see image preprocessing steps to apply to these datasets. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Image Preprocessing

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Argparse and command line arguments in Python

In this blog, we will learn what are the command line arguments, why do we use them and how to use them. We will also discuss argparse, the recommended command-line parsing module in the Python standard library.

Let’s start with a very simple example that calculates the area of a rectangle. Let’s name it rect.py.

Running the above script we get

Problem:

Now, if we were to calculate the area of a rectangle for different length and width, we have to change the function arguments. The above code is small so one might not bother changing 2 numbers. But the problem arises when the code is large and one need to change values at multiple locations.

Solution:

This problem can be solved if we provide the arguments at the runtime. For example, what we want is something as shown below

We want the 2’s should somehow be assigned to the length and width variable through the command line so that we don’t need to change the script every time. This can be done easily using the Python argparse module.

Before discussing the argparse module, I hope now you are able to answer the 2 questions, what are the command line arguments and why do we use them?

So, the parameters/arguments given to a program at runtime (For example, the 2’s from the command line shown above) are known as the command-line arguments.

Using command line arguments we can give our program different inputs or additional information without changing the code.

Now let’s see how to parse and access the command line argument. For this, we will use the argparse module.

Let’s again take the above rect.py and add few lines (Line 3 to Line 6)to it as shown below

Before running any program from the command line, the best practice is to know about the program. And this can be done using the -h or –help flag as shown below

So, using the -h or –help flag, we got a overview about the program without even looking into the program.

So, let’s discuss what the code lines we added above actually mean. And how to parse and assign the command-line arguments to length and width.

On Line 3, we instantiate the ArgumentParser object as ap. We can think of it as a container to hold the arguments. The description tells the main purpose of the program(It is optional).

Then on Line 4 and 5, we add the optional arguments length and width. We specify both the shorthand (-l) and longhand versions (–length). To run a program from the command line, specify the flag(either shorthand or longhand or mix) followed by the value as shown below.

The required=True make the optional argument compulsory. For example, if we run rect.py without the width argument it will throw an error as shown below

If we omit writing required=True, None value will be assigned to that flag. For example, omit required=True from the width(Line 5) and re-run rect.py without the width argument, None will be assigned to the width flag.

By default, the arguments are treated as a string so we explicitly defined the type as int for our problem. The help string gives additional information about the argument.

On Line 6, we parse the command line arguments using parse_args() method. This returns an object with the attributes specified in the add_argument() method. The name of the attribute is specified by the dest argument in the add_argument() method. For example in rect.py, ap.parse_args() returns an object with 2 attributes Length and Width.

To access the value of the command line arguments, we can simply use args.attribute_name, where attribute_name is the name specified in the dest argument. For example, args.Length and args.Width in rect.py example as shown above.

So, this is how you can use the argparse module for parsing command line arguments. I hope you now will feel comfortable when dealing with the command line arguments.

This blog contains only a gentle introduction to the argparse module. If you want to explore more, Python documentation is the way to go.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Histogram Matching (Specification)

In the previous blog, we discussed Histogram Equalization that tries to produce an output image that has a uniform histogram. This approach is good but for some cases, this does not work well. One such case is when we have skewed image histogram i.e. large concentration of pixels at either end of greyscale.

One reasonable approach is to manually specify the transformation function that preserves the general shape of the original histogram but has a smoother transition of intensity levels in the skewed areas.

So, in this blog, we will learn how to transform an image so that its histogram matches a specified histogram. Also known as histogram matching or histogram Specification.

Histogram Equalization is a special case of histogram matching where the specified histogram is uniformly distributed.

First let’s understand the main idea behind histogram matching.

We will first equalize both original and specified histogram using the Histogram Equalization method. As we know that the transformation function is invertible, so by inverting we can get the mapping from original to specified histogram. The whole operation is shown in the below image

For example, suppose the pixel value 10 in the original image gets mapped to 20 in the equalized image. Then we will see what value in Specified image gets mapped to 20 in the equalized image and let’s say that this value is 28. So, we can say that 10 in the original image gets mapped to 28 in the specified image.

Most of you might be thinking why both original and specified histogram on equalization converges to same uniform histogram.

This is true only if we assume continuous intensity values. But in reality, the intensity values are discrete thus both original and specified histograms may not map to the same histogram on equalization. That’s why Histogram matching is not able to perfectly match the specified histogram.

Let’s take an example where we want to match the original image with the specified image, both histograms are shown below.

Here, I am taking the original image from the histogram equalization blog. All the steps of equalization are explained in this blog. Here, I will only show the final table

Original Image Histogram Equalization

Specified Image Histogram Equalization

After equalizing both the images, we need to perform a mapping from original to equalized to the specified image. For that, we need only the round columns of the original and specified image as shown below.

Pick one by one the values from the round column of the original image, find it in the round column of the specified image and note down the index. For example for 3 in the round original, we have 3 in the round specified column (with index 1) so we map it to 1.

If the value doesn’t exist then find the index of its nearest one. For example for 0 in round original, 1 is the nearest in round specified column (with index 0) so we map it to 0.

If multiple nearest values exist then pick the one which is greater than the value. For example for 2 in the round original, there are 2 closest values in round specified i.e. 1 and 3 so we pick 3 (with index 1) so we map it to 1.

After obtaining the Map column, replace the values in the original image with the map values. This is the final result.

The matched histogram(shown on left) approximately matches with the specified histogram(shown on right) as shown below

Now, let’s see how to perform Histogram matching using OpenCV-Python

Code

Note: Specified image can have different dimensions as compared to the original image.

The output looks like this

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Floating Point Arithmetic Limitations in Python

Before moving forward just to clarify that the floating point arithmetic issue is not particular to Python. Almost all languages like C, C++, Java etc. often won’t display the exact decimal number you expect.

First we will discuss what are the floating point arithmetic limitations. Then we will see why this happens and what we can do to compensate these.

I hope you all have some idea of different base representation like 10-base, 2-base etc. Let’s get started

To understand the problem, let’s look at below expressions

Let’s take the first expression and understand why this happens

We know that Computers operate in binary (base 2 fractions). For example, 0.125 decimal fraction is represented as 0.001 in the computer.

There is no problem with integers as we can exactly represent these in binary. But unfortunately, most decimal fractions can’t be represented exactly as binary fractions.

Whenever we divide by a number which is prime and isn’t a factor of base, we always have non-terminating representation.

Thus for base 10, any x=p/q where q has prime factors other than 2 or 5 will have a non-terminating representation.

Simplest example is the case of representing 1/3 in decimal(0.3, 0.33,…, 0.33333…). No matter how many bits (or 3’s) we use, we still can’t represent it exactly.

Similarly, 0.1 is the simplest and the most commonly used example of an exact decimal number which can’t be represented exactly in binary. In base 2, 1/10 is the infinitely repeating fraction of 0011 as shown below

0.0001100110011001100110011001100110011001100110011…

So, Python strives to convert 0.1 to the closest fraction using 53 bits of precision (IEEE-754 double precision).

Thus when we write decimal 0.1 it is stored in computer as not exact but some close approximation. Since 0.1 is not exactly 1/10, summing three values of 0.1 may not yield exactly 0.3, either. Thus

But this contradicts our second expression mentioned above i.e.

This is because of IEEE-754 double precision which uses 53 bits of precision. As the number gets larger it uses more bits. In an attempt to fit into this 53-bit limit, bits have to be dropped off the end which causes rounding and sometimes the error may be visible and sometimes not.

For use cases which require exact decimal representation, you can use decimal or fraction module, scipy or even rounding can help.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Monitoring Training in Keras: Callbacks

Keras is a high level API, can run on top of Tensorflow, CNTK and Theano. Keras is preferable because it is easy and fast to learn. In this blog we will learn a set of functions named as callbacks, used during training in Keras.

Callbacks provides some advantages over normal training in keras. Here I will explain the important ones.

  • Callback can terminate a training when a Nan loss occurs.
  • Callback can save the model after every epoch, also you can save the best model.
  • Early Stopping: Callback can stop training when accuracy stops improving.

Terminate the training when Nan loss occurs

Let’s see the code how to terminate when Nan loss occurs while training:

Saving Model using Callbacks

To save model after every epoch in keras, we need to import ModelCheckpoint from keras.callbacks. Let’s see the below code which will save the model if validation loss decreases.

In the above code first we have created a ModelCheckpoint object by passing its required parameters.

  • filepath” defines the path where all checkpoints will be saved. If you want to save only the best model, then directly pass filepath with name “best_model.hdf5” which will overwrite the previous saved checkpoints.
  • monitor” will decide which quantity you want to monitor while training.
  • save_best_only” only saves if validation loss decreases.
  • mode with {auto, min, max} when chosen max, will stop training when monitored quantity stops increasing.

Then finally make a callback list and pass it into model.fit() with parameter callbacks.

Early Stopping

Callbacks can stop training when a monitored quantity has stopped improving. Lets see how:

  • min_delta: It is the minimum quantity which will be taken for improvement to be conceded.
  • patience: after this number of epochs if training does not improve, it will stop.
  • mode: in auto mode, the direction will be decided by monitored quantity.
  • baseline” the baseline value over which no improvement will stop the training.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Set Camera Timer using OpenCV-Python

Most of you must have clicked the photograph with a Timer. This feature sets a countdown before clicking a photograph. In this tutorial, we will be doing the same i.e. creating our own camera timer using OpenCV-Python. Sounds interesting, so let’s get started.

The main idea is that whenever a particular key is pressed (Here, I have used ‘q’), the countdown will begin and a photo will be clicked and saved at the desired location. Otherwise the video will continue streaming.

Here, we will be using cv2.putText() function for drawing the countdown on the video. This function has the following arguments

This function draws the text on the input image at the specified position. If the specified font is unable to render any character, it is replaced by a question mark.

Now let’s see how to do this

Steps:

  • Open the camera using cv2.VideoCapture()
  • Until the camera is open
    • Read the frame and display it using cv2.imshow()
    • Set the countdown. Here, I have taken this as 30 and I am displaying it after 10 frames so that it is easily visible. Otherwise, it will be too fast. You can set it to anything as you wish
    • Set a key for the countdown to begin
    • If the key is pressed, show the countdown on the video using cv2.putText(). As the countdown finishes, save the frame at the desired location.
    • Otherwise, the video will continue streaming
  • On pressing ‘Esc’ the video will stop streaming.

Code:

The output looks like this

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Downloading Video from YouTube using Python

YouTube is a rich source of videos and downloading videos from YouTube is a little difficult. There are some extensions and downloaders available but those are sometimes not recommended. But in Python, it is quite easy to download video from YouTube.

So, in this blog we will learn how to download videos from YouTube directly using video link.

  • To download videos from YouTube, we will use a python library named ‘pytube’. First you need to install ‘pytube’ using following command.
  • Now we are having all necessary libraries required for this task. Now import the required module from ‘pytube’ library.
  • The only thing that you will need from video is it’s video link.
  • Then create an object of the earlier imported module ‘YouTube’ using this video link.
  • After getting object we need to get stream and from that get the first result of the stream which will be our required video.
  • Now we have our video stream, to download it just call the following function.

The above code will download the video in the current working directory with the name of the video as it was in the YouTube. To download it to a specific folder with specific name you need pass two arguments in the above code as below.

That was a simple code to download videos from YouTube. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.