Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.
Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.
Introduction
CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.
There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.
To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.
The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.
Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.
CTPN Algorithm
Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.
- Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
- Features output from the last convolutional maps of the VGG16 model is taken.
- These outputs are passed through a 3×3 spatial window.
- Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
- The recurrent output is then fed to a 512-D fully connected layer.
- Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.
VGG Network
CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.
3×3 Spatial Window and Recurrent Layer
A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.
Output Layer
The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.
A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.
The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.
The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.
Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.
If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.
Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network