Tag Archives: ocr pipeline

Optical Character Recognition Pipeline: Text Detection and Segmentation

One of the most important module in optical character recognition pipeline is the text detection and segmentation which is also called as text localization. In the previous blog, we have seen various techniques to pre-process the input image which can help in improving our OCR accuracy. In this blog, we will learn how to localize text in an image, so that we can crop them out and then feed to our text recognition module to predict text in it.

What is text detection and segmentation?

It is the process of localizing all occurrence of text present in the image into meaningful units such as characters, words, and text lines. Then make segments of each of these units.

Character-based detection first detects individual characters and then group them into words. One way to do this is to locate characters by classifying Extremal Regions(MSER) and then groups the detected characters by an exhaustive search method.

Word-based detection usually works in a similar fashion as object detection. You can use Faster R-CNN and YOLO algorithms to perform this.

Text-line based detection detects text lines and then break it into individual words.

There are basically two types of text images that are fed to the text recognition module as inputs. One is scanned documents and others are natural scene text like street signs, storefront texts, etc.

Scanned Documents

Scanned documents generally have hundreds or thousands of words in it. We can apply deep neural networks like faster R-CNN and YOLO to localize words present in the documents. But sometimes these may not be able to localize all text present in the images because these algorithms are generally trained to detect less number of objects in the image. In that case, we need to apply some post-processing after deep nets to recognize remaining texts.

Another OpenCV method which we can be used for scanned documents is Maximally Stable Extremal Regions(MSER) using OpenCV.

MSER is a method that is used for blob detection in images. Using this method we can get the coordinates of the text regions and then we can generate the bounding boxes around each word in the image. Through which we can get the required input images to our text recognition module.

Natural Scenes

Natural scenes contain a lesser number of words in it but consist of other problems like distortions, occlusions, directional blur, cluttered background, etc. To overcome these problems we need to develop some deep learning algorithm that is mainly focused on natural scene texts ignoring above distortions. There are some robust open source algorithms available like EAST, CTPN, TextBoxes++, PixelLink and etc. These algorithms can also be used for localizing texts in the scanned documents but then you need to do some post processing to detect all text present in the image as I have mentioned earlier.

Till now we have seen what is text segmentation and different algorithms to localize texts in an image. In the next blog, we will deep dive into these algorithms and figure out how we can implement it in our OCR pipeline.

Next Blog: Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
Download its pretrained model and put inside EAST directory.
Before testing it you need to compile the lanms.
To test this model, go to your EAST directory and then run following command from terminal:

python eval.py --test_data_path=/demo_images/ --gpu_list=0 --checkpoint_path=/east_icdar2015_resnet_v1_50_rbox/

1	python eval.py --test_data_path=/demo_images/ --gpu_list=0 --checkpoint_path=/east_icdar2015_resnet_v1_50_rbox/

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

python multigpu_train.py --gpu_list=0 --input_size=512 --batch_size_per_gpu=14 --checkpoint_path=/tmp/east_icdar2015_resnet_v1_50_rbox/ --text_scale=512 --training_data_path=/data/ocr/icdar2015/ --geometry=RBOX --learning_rate=0.0001 --num_readers=24 --pretrained_model_path=/tmp/resnet_v1_50.ckpt

1	python multigpu_train.py --gpu_list=0 --input_size=512 --batch_size_per_gpu=14 --checkpoint_path=/tmp/east_icdar2015_resnet_v1_50_rbox/ --text_scale=512 --training_data_path=/data/ocr/icdar2015/ --geometry=RBOX --learning_rate=0.0001 --num_readers=24 --pretrained_model_path=/tmp/resnet_v1_50.ckpt

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
Second 2k text/non-text scores and,
third is k side refinement offset.
Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
Go to “text-detection-ctpn-banjin-dev” directory
Run following command one by one:

cd utils/bbox
chmod +x make.sh
./make.sh

cd utils/bbox

chmod +x make.sh

./make.sh

Download pretrained checkpoint from google drive
Extract it and put checkpoints_mlt/ in text-detection-ctpn/
Now put your text file in data/demo and output will be in data/res
Now run the following command to check the outputs

python ./main/demo.py

1	python ./main/demo.py

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

1 Reply

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

TheAILearner

Mastering Artificial Intelligence