In the previous blogs, we discussed different pre-processing techniques such as noise removal, skew correction, etc. The main objective of this pre-processing step was to make the image suitable for the next pipeline components such as text detection, and recognition. Now, in this blog, let’s understand the text detection step in detail.
Text Detection
Text detection simply means finding the regions in the image where the text can be present. For instance, see the below image where green colored bounding boxes are drawn around the detected text.
While performing text detection, you may encounter two types of cases
- Images with Structured text: This refers to the images that have a clean/uniform background with regular font. Text is mostly dense with proper row structure and uniform text color. For instance, see the below image.
- Images with Unstructured text: This refers to the images with sparse text on a complex background. The text can have different colors, size, fonts, and orientations and can be present anywhere in the image. Performing text detection on these images is known as scene text detection. For instance, see the below image.
Now, if I ask, which one of the above two cases looks more challenging. Obviously, the answer would be the scene text detection one, due to various complexities as discussed above. And that’s why this is an active research topic in computer vision.
Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.
While performing text detection, you have 3 options. Either you do
- Character-by-Character detection
- Word-by-Word detection
- Line-by-Line detection
All three are shown below.
Nowadays, we mostly prefer doing word or line detection. This is because the character detection is generally slow and is somewhat more challenging as compared to the other two.
Mostly, the text detection methods can be broadly classified into 2 categories
- Conventional methods
- Deep-learning based methods
Conventional methods rely on manually designed features. For instance, Stroke width Transform (SWT) and Maximally Stable Extremal Regions (MSER) based methods generally extracts the character candidates via edge detection or extremal region extraction. While in the deep learning based methods, features are learned from the training data. These are generally better than the conventional ones, in terms of both accuracy and adaptability in challenging scenarios.
Further, the deep learning based methods can be classified into
- Multi-step methods
- Simplified pipeline
To understand these, take a look at the below image where the pipeline of several state-of-the-art text detection methods is shown. The first 3 methods (a,b,c) fall into the multi-step category (each box denotes 1 step) while the last 2 (d,e) are the ones with a simplified pipeline.
In this series, we will be mainly focussing on the methods with the simplified pipeline. By the way, the last 2 methods (d,e) shown above are known as Connectionist Text Proposal Network (CTPN) and Efficient and Accurate Scene Text Detector (EAST) respectively. Both of these are very famous text detection methods!!!
In the next blog, let’s discuss the EAST algorithm in detail. Till then, have a great time. Hope you enjoy reading.
If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.