Tag Archives: CTPN

Implementation of Connectionist Text Proposal Network (CTPN)

In the previous blog we have learnt about CTPN algorithm and its architecture in detail. In this blog we will implement this algorithm using its GitHub repository to localize text in an image. We will use Linux operating system to do this.

Clone the Repository

Open a terminal window and clone the CTPN GitHub Repo using following command:

Build the Required Library

Non max suppression (NMS) and bounding box (bbox) utilities are written in cython. We need to generate .so file for these so that required files can be loaded into the library. We first need to change current directory to “/text-detection-ctpn/utils/bbox” using following commands:

Now run the following commands to build the library.

These commands will generate nms.so and bbox.so in the current directory.

Test the model

Now we can test the CTPN model. To test the model we first need to download the checkpoints. These checkpoints are already provided in the GitHub repository to test the model. You can download the checkpoints from google drive. Now use following steps:

  1. Unzip the downloaded checkpoints.
  2. Place the unzipped folder “checkpoints_mlt” in directory ” /text-detection-ctpn”.
  3. Put your testing images in /data/demo/ folder and your outputs will be generated in /data/res folder.
  4. Your folder structure will look like follows.

Now run the following command from terminal to test your input images. Change your directory to ” “/text-detection-ctpn” first.

Your output must have been generated on data/res folder. Some of the input and results are shown below.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Referenced GitHub Code: text-detection-ctpn

Connectionist Text Proposal Network (CTPN)

Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.

Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.

Introduction

CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.

There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.

To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.

The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.

Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.

CTPN Algorithm

Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

VGG Network

CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.

3×3 Spatial Window and Recurrent Layer

A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.

Output Layer

The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.

A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.

The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.

The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.

Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

In the last blog, we have seen what is text detection and different types of algorithms to perform it, In this blog, we will learn more about text detection algorithms.

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  1. Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
  2. Download its pretrained model and put inside EAST directory.
  3. Before testing it you need to compile the lanms.
  4. To test this model, go to your EAST directory and then run following command from terminal:

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

  • This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
  • One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
  • Second 2k text/non-text scores and,
  • third is k side refinement offset.
  • Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
  • On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  • Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
  • Go to “text-detection-ctpn-banjin-dev” directory
  • Run following command one by one:
  • Download pretrained checkpoint from google drive
  • Extract it and put checkpoints_mlt/ in text-detection-ctpn/
  • Now put your text file in data/demo and output will be in data/res
  • Now run the following command to check the outputs

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

Source

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.