Before the introduction of deep learning in the field of text detection, it was difficult for most text segmentation approaches to perform on challenging scenarios. Conventional approaches use manually designed features while deep learning methods learn effective features from training data. These conventional approaches are usually multi-staged which ends with slightly lesser overall performance. In this blog, we will learn a deep learning-based algorithm (EAST) that detects text with a single neural network with the elimination of multi-stage approaches.
Introduction
The EAST algorithm uses a single neural network to predict a word or line-level text. It can detect text in arbitrary orientation with quadrilateral shapes. In 2017 this algorithm outperformed state of the art methods. This algorithm consists of a fully convolutional network with a non-max suppression (NMS) merging state. The fully convolutional network is used to localize text in the image and this NMS stage is basically used to merge many imprecise detected text boxes into a single bounding box for every text region (word or line text).
EAST Network Architecture
The EAST architecture was created while taking different sizes of word regions into account. The idea was to detect large word regions that require features from the later stage of the neural network while detecting small word regions that require low-level features from initial stages. To create this network, authors have used three branches combining into a single neural network.
1. Feature Extractor Stem
This branch of the network is used to extract features from different layers of the network. This stem can be a convolutional network pretrained on the ImageNet dataset. Authors of EAST architecture used PVANet and VGG16 both for the experiment. In this blog, we will see EAST architecture with the VGG16 network only. Let’s see the architecture of the VGG16 model.
For the stem of architecture, it takes the output from the VGG16 model after pool2, pool3, pool4, and pool5 layers.
2. Feature Merging Branch
In this branch of the EAST network, it merges the feature outputs from a different layer of the VGG16 network. The input image is passed through the VGG16 model and outputs from different four layers of VGG16 are taken. Merging these feature maps will be computationally expensive. That’s why EAST uses a U-net architecture to merge feature maps gradually (see EAST architecture figure). Firstly, outputs after the pool5 layer are upsampled using a deconvolutional layer. Now the size of features after this layer would be equal to outputs from the pool4 layer and both are then merged into one layer. Then Conv 1×1 and Conv 3×3 are applied to fuse the information and produce the output of this merging stage.
Similarly outputs from other layers of the VGG16 model are concatenated and finally, a Conv 3×3 layer is applied to produce the final feature map layer before the output layer.
3. Output Layer
The output layer consists of a score map and a geometry map. The score map tells us the probability of text in that region while the geometry map defines the boundary of the text box. This geometry map can be either a rotated box or quadrangle. A rotated box consists of top-left coordinate, width, height and rotation angle for the text box. While quadrangle consists of all four coordinates of a rectangle.
Note: For more details on the Optical Character Recognition, please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.
Loss Function
The loss function used in this EAST algorithm consists of both score map loss and geometry loss function.
As you can see in the above formula, both losses are combined with a weight λ. This λ is for giving importance to different losses. In the EAST paper, authors have used it as 1.
Non-max Suppression Merging Stage
Predicted geometries after fully convolutional network are passed through a threshold value. After this thresholding, remaining geometries are suppressed using a locality aware NMS. A Naive NMS runs in O(n2). But to run this in O(n), authors adopted a method which uses suppression row by row. This row by row suppression also takes into account iteratively merging of the last merged one. This makes this algorithm fast in most of the cases but the worst time complexity is still O(n2).
This was all about the Efficient and Accurate Scene Text algorithm. In the next blog, we will implement this algorithm using its GitHub Repository. Hope you enjoy reading.
If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.
Referenced Research Paper: EAST: An Efficient and Accurate Scene Text Detector