One of the most important module in optical character recognition pipeline is the text detection and segmentation which is also called as text localization. In the previous blog, we have seen various techniques to pre-process the input image which can help in improving our OCR accuracy. In this blog, we will learn how to localize text in an image, so that we can crop them out and then feed to our text recognition module to predict text in it.
What is text detection and segmentation?
It is the process of localizing all occurrence of text present in the image into meaningful units such as characters, words, and text lines. Then make segments of each of these units.
Character-based detection first detects individual characters and then group them into words. One way to do this is to locate characters by classifying Extremal Regions(MSER) and then groups the detected characters by an exhaustive search method.
Word-based detection usually works in a similar fashion as object detection. You can use Faster R-CNN and YOLO algorithms to perform this.
Text-line based detection detects text lines and then break it into individual words.
There are basically two types of text images that are fed to the text recognition module as inputs. One is scanned documents and others are natural scene text like street signs, storefront texts, etc.
Scanned Documents
Scanned documents generally have hundreds or thousands of words in it. We can apply deep neural networks like faster R-CNN and YOLO to localize words present in the documents. But sometimes these may not be able to localize all text present in the images because these algorithms are generally trained to detect less number of objects in the image. In that case, we need to apply some post-processing after deep nets to recognize remaining texts.
Another OpenCV method which we can be used for scanned documents is Maximally Stable Extremal Regions(MSER) using OpenCV.
MSER is a method that is used for blob detection in images. Using this method we can get the coordinates of the text regions and then we can generate the bounding boxes around each word in the image. Through which we can get the required input images to our text recognition module.
Natural Scenes
Natural scenes contain a lesser number of words in it but consist of other problems like distortions, occlusions, directional blur, cluttered background, etc. To overcome these problems we need to develop some deep learning algorithm that is mainly focused on natural scene texts ignoring above distortions. There are some robust open source algorithms available like EAST, CTPN, TextBoxes++, PixelLink and etc. These algorithms can also be used for localizing texts in the scanned documents but then you need to do some post processing to detect all text present in the image as I have mentioned earlier.
Till now we have seen what is text segmentation and different algorithms to localize texts in an image. In the next blog, we will deep dive into these algorithms and figure out how we can implement it in our OCR pipeline.
Next Blog: Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II
Hope you enjoy reading.
If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.