Tag Archives: ocr

CTC – Problem Statement

In the previous blog, we had an overview of the text recognition step. There we discussed that in order to avoid character segmentation, two major techniques have been adopted. One is CTC-based and another one is Attention-based. So, in this blog, let’s first discuss the intuition behind the CTC algorithm like why do we even need this or where is this algorithm used. And then in the next blog, we will discuss this algorithm in detail. Here, we will understand this using the Text Recognition case. Let’s get started.

As we have already discussed that in text recognition, we are given a segmented image and our task is to recognize what text is present in that segment. Thus, for the text recognition problem the input is an image while the output is a text as shown below.

So, in order to solve the text recognition problem, we need to develop a model that takes the image as an input and outputs the recognized text. If you have ever taken any deep learning class, you must know that the Convolutional Neural Networks (CNNs) are good in handling image data, while for the sequence data such as text, Recurrent Neural Networks (RNNs) are preferred.

So, for the text recognition problem, an obvious choice would be to use a combination of Convolutional Neural Network and Recurrent Neural Network. Now, let’s discuss how to combine CNN and RNN together for the text recognition task. Below is one such architecture that combines the CNN and RNN together. This is taken from the famous CRNN paper.

In this, first, the input image is fed through a number of convolutional layers to extract the feature maps. These feature maps are then divided into a sequence of feature vectors as shown by the blue color. These are obtained by dividing the feature maps into columns of single-pixel width. Now, a question might come to your mind that why are we dividing the feature maps by columns. The answer to this question lies in the receptive field concept. The receptive field is defined as the region in the input image that a particular CNN’s feature map is looking at. For instance, for the above input image, the receptive field of each feature vector corresponds to a rectangular region in the input image as shown below.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

And each of these rectangular regions is ordered from left to right. Thus, each feature vector can be considered as the image descriptor of that rectangular region. These feature vectors are then fed to a bi-directional LSTM. Because of the softmax activation function, this LSTM layer outputs the probability distribution at each time step over the character set. To obtain the per-timestep output, we can either take the max of the probability distribution at each time step or apply any other method.

But as you might have noticed in the above image that these feature vectors sometimes may not contain the complete character. For instance, see the below image where the 2 feature vectors marked by red color contains some part of the character “S”.

Thus, in the LSTM output, we may get repeated characters as shown below by the red box. We call these per-frame or per-timestep predictions.

Now, here comes the problem. As we have already discussed, that for the text recognition, the training data consists of images and the corresponding text as shown below.

Training Data for text recognition

Thus, we only know the final output and we don’t know the per-timestep predictions. Now, in order to train this network, we either need to know the per-timestep output for each input image or we need to develop a mechanism to convert either per-timestep output to final output or vice-versa.

So, the problem is how to align the final output with the per-timestep predictions in order to train this network?

Approach 1

One thing we can do is devise a rule like “one character corresponds to some fixed time steps”. For instance, for the above image, if we have 10 timesteps, then we can repeat “State” as “S, S, T, T, A, A, T, T, E, E” (repeat each character twice) and then train the network. But this approach can be easily violated for different fonts, writing styles, etc.

Approach 2

Another approach can be to manually annotate the data for each time step as shown below. Then train the network using this annotated data. The problem with this approach is that this will be very time consuming for a reasonably sized dataset.

Annotation for each timestep

Clearly, both the above naïve approaches have some serious downsides. So, isn’t there any efficient way to solve this? This is where the CTC comes into picture.

Connectionist Temporal Classification(CTC)

This was introduced in 2006 and is used for training deep networks where alignment is a problem. With CTC, we need not to worry about the alignments or the per-timestep predictions. CTC takes care of all the alignments and now we can train the network only using the final outputs. So, in the next blog, let’s discuss how the CTC algorithm works. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Text Recognition Datasets

In the previous blog, we build our own Text Recognition system from scratch using the very famous CNN+RNN+CTC based approach. As you might remember, we got pretty decent results. In order to further fine-tune our model, one thing we can do is more training. But for that, we need more training data. So, in this blog, let’s discuss some of the open-source text recognition datasets available and how to create synthetic data for text recognition. Let’s get started.

Open Source Datasets

Below are some of the open source text recognition datasets available.

  • The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
  • MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
  • IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
  • The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

Then, we will create a word of length word_size and starting with the current character as shown below.

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

Finally, save the image and the corresponding text file as shown below.

Below is the full code

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of EAST

In the previous blog, we discussed the theory behind the EAST algorithm. If you remember, we stated that this algorithm is both accurate and efficient. So, in this blog, let’s find it out. For this, first, we will run the EAST algorithm using its Github repository, and then we will analyze the results. So, let’s get started. Here, I’m using a Linux system.

Clone the Repository

First, search “EAST Github” in the browser. You will find several EAST implementations but in this blog, we will use the one provided by argman. So, open this and clone the repository. In order to clone the repository, you can either use git or download it as a zip file. To install git, you can run the following command.

Once you have installed git, clone the repository using the following command.

This will clone the repository into your system as shown below.

Compile lanms

As you might remember, in the previous blog, we discussed that the EAST algorithm uses a Locality-Aware NMS (lanms) instead of the standard NMS. Now, you need to compile the lanms. Why? because this GitHub implementation contains the lanms code written in C++ (See the lanms folder). So, in order to make it work with Python, we need to generate an adaptor.so file. This can be done as follows.

First, we need to install the g++ compiler in order to compile the adaptar.cpp file. This can be done using the following command.

This contains the essential tools for building most other packages from source (e.g. C/C++ compiler, libc, and make).

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Next, open the __init__.py file present inside the lanms folder and comment out the if condition as shown below.

Again open the terminal and change the directory to the lanms folder. After this, run the make command as shown below. This will generate the required adaptor.so file in the lanms folder.

Test the model

Now, to test the model, either we need to first train it or find some pre-trained weights, if available. Luckily, pre-trained weights are available. You can download it from here. These are trained on the ICDAR-2013 and ICDAR-2015 datasets.

After downloading the pre-trained weights, extract them and place them inside the EAST folder. Now, to test the model, open the terminal and change the directory to the EAST folder. Also activate the virtual environment if any. Then type the following command by giving the arguments.

For arguments, first, we need to specify the test images path as a “test_data_path” argument. Second, we need to specify the recently downloaded checkpoints path as a “checkpoint_path” argument. And lastly, we need to specify the output directory path as an “output_dir” argument as shown below. This will automatically create the output directory if not present.

This will run the EAST algorithm on the test images we provided. Below an output image is shown.

In the next blog, we will explore different text detection datasets that are available. We will also learn how we can create our own text detection dataset. This will help us with training and fine-tuning our EAST model further. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection

In the previous blogs, we discussed different pre-processing techniques such as noise removal, skew correction, etc. The main objective of this pre-processing step was to make the image suitable for the next pipeline components such as text detection, and recognition. Now, in this blog, let’s understand the text detection step in detail.

Text Detection

Text detection simply means finding the regions in the image where the text can be present. For instance, see the below image where green colored bounding boxes are drawn around the detected text.

While performing text detection, you may encounter two types of cases

  • Images with Structured text: This refers to the images that have a clean/uniform background with regular font. Text is mostly dense with proper row structure and uniform text color. For instance, see the below image.
  • Images with Unstructured text: This refers to the images with sparse text on a complex background. The text can have different colors, size, fonts, and orientations and can be present anywhere in the image. Performing text detection on these images is known as scene text detection. For instance, see the below image.

Now, if I ask, which one of the above two cases looks more challenging. Obviously, the answer would be the scene text detection one, due to various complexities as discussed above. And that’s why this is an active research topic in computer vision.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

While performing text detection, you have 3 options. Either you do

  • Character-by-Character detection
  • Word-by-Word detection
  • Line-by-Line detection

All three are shown below.

Nowadays, we mostly prefer doing word or line detection. This is because the character detection is generally slow and is somewhat more challenging as compared to the other two.

Mostly, the text detection methods can be broadly classified into 2 categories

  • Conventional methods
  • Deep-learning based methods

Conventional methods rely on manually designed features. For instance, Stroke width Transform (SWT) and Maximally Stable Extremal Regions (MSER) based methods generally extracts the character candidates via edge detection or extremal region extraction. While in the deep learning based methods, features are learned from the training data. These are generally better than the conventional ones, in terms of both accuracy and adaptability in challenging scenarios.

Further, the deep learning based methods can be classified into

  • Multi-step methods
  • Simplified pipeline

To understand these, take a look at the below image where the pipeline of several state-of-the-art text detection methods is shown. The first 3 methods (a,b,c) fall into the multi-step category (each box denotes 1 step) while the last 2 (d,e) are the ones with a simplified pipeline.

In this series, we will be mainly focussing on the methods with the simplified pipeline. By the way, the last 2 methods (d,e) shown above are known as Connectionist Text Proposal Network (CTPN) and Efficient and Accurate Scene Text Detector (EAST) respectively. Both of these are very famous text detection methods!!!

In the next blog, let’s discuss the EAST algorithm in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of Efficient and Accurate Scene Text Detector (EAST)

In the previous blog, we discussed the EAST algorithm, its architecture and its usage. In this blog, we will see how to implement the EAST using its GitHub Repository We will do this implementation in a Linux system.

Clone the Repository

First, you need to clone its GitHub repository on your system and change your directory to the EAST folder by using the following command.

Download Pretrained Checkpoints

Now to test this EAST model, you first need to download the pretrained checkpoints trained on ICDAR 2013 and ICDAR 2015 dataset. You can download the checkpoints from the following link:

Google Drive Link

Test the Model

After downloading pretrained checkpoints and cloning the GitHub repository, you are ready to test the model using the following command:

In the above command, you need to specify some directory paths. First, you need to specify your test image dataset path as a “test_data_path” argument. Second, you need to specify your recently downloaded checkpoints path as a “checkpoint_path” argument. And lastly, you need to specify your output directory path as an “output_dir” argument.

Sometimes you may end up with common adaptor and lanms error as shown in the following figure.

To solve these errors, you just need to use the following links or you can just google them.

  1. can not compile lanms
  2. running eval.py; undefined symbol: _Py_ZeroStruct

Running EAST using WEB

We can also run a demo by using the run_demo_server.py file provided by the GitHub repository. We just need to run the following command:

As you can see demo server is running on default port number 8769. Now you just need to open your web browser and submit the following URL:

http://localhost:8769/

Then upload the image and click on the submit button. After processing, you will see the results something like this.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced GitHub Repository: EAST: An Efficient and Accurate Scene Text Detector

Efficient and Accurate Scene Text Detector (EAST)

Before the introduction of deep learning in the field of text detection, it was difficult for most text segmentation approaches to perform on challenging scenarios. Conventional approaches use manually designed features while deep learning methods learn effective features from training data. These conventional approaches are usually multi-staged which ends with slightly lesser overall performance. In this blog, we will learn a deep learning-based algorithm (EAST) that detects text with a single neural network with the elimination of multi-stage approaches.

Introduction

The EAST algorithm uses a single neural network to predict a word or line-level text. It can detect text in arbitrary orientation with quadrilateral shapes. In 2017 this algorithm outperformed state of the art methods. This algorithm consists of a fully convolutional network with a non-max suppression (NMS) merging state. The fully convolutional network is used to localize text in the image and this NMS stage is basically used to merge many imprecise detected text boxes into a single bounding box for every text region (word or line text).

EAST Network Architecture

The EAST architecture was created while taking different sizes of word regions into account. The idea was to detect large word regions that require features from the later stage of the neural network while detecting small word regions that require low-level features from initial stages. To create this network, authors have used three branches combining into a single neural network.

EAST

1. Feature Extractor Stem

This branch of the network is used to extract features from different layers of the network. This stem can be a convolutional network pretrained on the ImageNet dataset. Authors of EAST architecture used PVANet and VGG16 both for the experiment. In this blog, we will see EAST architecture with the VGG16 network only. Let’s see the architecture of the VGG16 model.

VGG16

For the stem of architecture, it takes the output from the VGG16 model after pool2, pool3, pool4, and pool5 layers.

2. Feature Merging Branch

In this branch of the EAST network, it merges the feature outputs from a different layer of the VGG16 network. The input image is passed through the VGG16 model and outputs from different four layers of VGG16 are taken. Merging these feature maps will be computationally expensive. That’s why EAST uses a U-net architecture to merge feature maps gradually (see EAST architecture figure). Firstly, outputs after the pool5 layer are upsampled using a deconvolutional layer. Now the size of features after this layer would be equal to outputs from the pool4 layer and both are then merged into one layer. Then Conv 1×1 and Conv 3×3 are applied to fuse the information and produce the output of this merging stage.

Similarly outputs from other layers of the VGG16 model are concatenated and finally, a Conv 3×3 layer is applied to produce the final feature map layer before the output layer.

3. Output Layer

The output layer consists of a score map and a geometry map. The score map tells us the probability of text in that region while the geometry map defines the boundary of the text box. This geometry map can be either a rotated box or quadrangle. A rotated box consists of top-left coordinate, width, height and rotation angle for the text box. While quadrangle consists of all four coordinates of a rectangle.

Note: For more details on the Optical Character Recognition, please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Loss Function

The loss function used in this EAST algorithm consists of both score map loss and geometry loss function.

As you can see in the above formula, both losses are combined with a weight λ. This λ is for giving importance to different losses. In the EAST paper, authors have used it as 1.

Non-max Suppression Merging Stage

Predicted geometries after fully convolutional network are passed through a threshold value. After this thresholding, remaining geometries are suppressed using a locality aware NMS. A Naive NMS runs in O(n2). But to run this in O(n), authors adopted a method which uses suppression row by row. This row by row suppression also takes into account iteratively merging of the last merged one. This makes this algorithm fast in most of the cases but the worst time complexity is still O(n2).

This was all about the Efficient and Accurate Scene Text algorithm. In the next blog, we will implement this algorithm using its GitHub Repository. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: EAST: An Efficient and Accurate Scene Text Detector

Implementation of Connectionist Text Proposal Network (CTPN)

In the previous blog we have learnt about CTPN algorithm and its architecture in detail. In this blog we will implement this algorithm using its GitHub repository to localize text in an image. We will use Linux operating system to do this.

Clone the Repository

Open a terminal window and clone the CTPN GitHub Repo using following command:

Build the Required Library

Non max suppression (NMS) and bounding box (bbox) utilities are written in cython. We need to generate .so file for these so that required files can be loaded into the library. We first need to change current directory to “/text-detection-ctpn/utils/bbox” using following commands:

Now run the following commands to build the library.

These commands will generate nms.so and bbox.so in the current directory.

Test the model

Now we can test the CTPN model. To test the model we first need to download the checkpoints. These checkpoints are already provided in the GitHub repository to test the model. You can download the checkpoints from google drive. Now use following steps:

  1. Unzip the downloaded checkpoints.
  2. Place the unzipped folder “checkpoints_mlt” in directory ” /text-detection-ctpn”.
  3. Put your testing images in /data/demo/ folder and your outputs will be generated in /data/res folder.
  4. Your folder structure will look like follows.

Now run the following command from terminal to test your input images. Change your directory to ” “/text-detection-ctpn” first.

Your output must have been generated on data/res folder. Some of the input and results are shown below.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Referenced GitHub Code: text-detection-ctpn

Connectionist Text Proposal Network (CTPN)

Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.

Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.

Introduction

CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.

There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.

To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.

The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.

Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.

CTPN Algorithm

Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

VGG Network

CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.

3×3 Spatial Window and Recurrent Layer

A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.

Output Layer

The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.

A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.

The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.

The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.

Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Optical Character Recognition Pipeline: Text Detection and Segmentation

One of the most important module in optical character recognition pipeline is the text detection and segmentation which is also called as text localization. In the previous blog, we have seen various techniques to pre-process the input image which can help in improving our OCR accuracy. In this blog, we will learn how to localize text in an image, so that we can crop them out and then feed to our text recognition module to predict text in it.

What is text detection and segmentation?

It is the process of localizing all occurrence of text present in the image into meaningful units such as characters, words, and text lines. Then make segments of each of these units.

Character-based detection first detects individual characters and then group them into words. One way to do this is to locate characters by classifying Extremal Regions(MSER) and then groups the detected characters by an exhaustive search method.

Word-based detection usually works in a similar fashion as object detection. You can use Faster R-CNN and YOLO algorithms to perform this.

Text-line based detection detects text lines and then break it into individual words.

There are basically two types of text images that are fed to the text recognition module as inputs. One is scanned documents and others are natural scene text like street signs, storefront texts, etc.

Scanned Documents

Scanned documents generally have hundreds or thousands of words in it. We can apply deep neural networks like faster R-CNN and YOLO to localize words present in the documents. But sometimes these may not be able to localize all text present in the images because these algorithms are generally trained to detect less number of objects in the image. In that case, we need to apply some post-processing after deep nets to recognize remaining texts.

Another OpenCV method which we can be used for scanned documents is Maximally Stable Extremal Regions(MSER) using OpenCV.

MSER is a method that is used for blob detection in images. Using this method we can get the coordinates of the text regions and then we can generate the bounding boxes around each word in the image. Through which we can get the required input images to our text recognition module.

Natural Scenes

Natural scenes contain a lesser number of words in it but consist of other problems like distortions, occlusions, directional blur, cluttered background, etc. To overcome these problems we need to develop some deep learning algorithm that is mainly focused on natural scene texts ignoring above distortions. There are some robust open source algorithms available like EAST, CTPN, TextBoxes++, PixelLink and etc. These algorithms can also be used for localizing texts in the scanned documents but then you need to do some post processing to detect all text present in the image as I have mentioned earlier.

Source

Till now we have seen what is text segmentation and different algorithms to localize texts in an image. In the next blog, we will deep dive into these algorithms and figure out how we can implement it in our OCR pipeline.

Next Blog: Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

In the last blog, we have seen what is text detection and different types of algorithms to perform it, In this blog, we will learn more about text detection algorithms.

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  1. Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
  2. Download its pretrained model and put inside EAST directory.
  3. Before testing it you need to compile the lanms.
  4. To test this model, go to your EAST directory and then run following command from terminal:

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

  • This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
  • One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
  • Second 2k text/non-text scores and,
  • third is k side refinement offset.
  • Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
  • On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  • Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
  • Go to “text-detection-ctpn-banjin-dev” directory
  • Run following command one by one:
  • Download pretrained checkpoint from google drive
  • Extract it and put checkpoints_mlt/ in text-detection-ctpn/
  • Now put your text file in data/demo and output will be in data/res
  • Now run the following command to check the outputs

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

Source

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.