Tag Archives: optical character recognition

CTC – Problem Statement

In the previous blog, we had an overview of the text recognition step. There we discussed that in order to avoid character segmentation, two major techniques have been adopted. One is CTC-based and another one is Attention-based. So, in this blog, let’s first discuss the intuition behind the CTC algorithm like why do we even need this or where is this algorithm used. And then in the next blog, we will discuss this algorithm in detail. Here, we will understand this using the Text Recognition case. Let’s get started.

As we have already discussed that in text recognition, we are given a segmented image and our task is to recognize what text is present in that segment. Thus, for the text recognition problem the input is an image while the output is a text as shown below.

So, in order to solve the text recognition problem, we need to develop a model that takes the image as an input and outputs the recognized text. If you have ever taken any deep learning class, you must know that the Convolutional Neural Networks (CNNs) are good in handling image data, while for the sequence data such as text, Recurrent Neural Networks (RNNs) are preferred.

So, for the text recognition problem, an obvious choice would be to use a combination of Convolutional Neural Network and Recurrent Neural Network. Now, let’s discuss how to combine CNN and RNN together for the text recognition task. Below is one such architecture that combines the CNN and RNN together. This is taken from the famous CRNN paper.

In this, first, the input image is fed through a number of convolutional layers to extract the feature maps. These feature maps are then divided into a sequence of feature vectors as shown by the blue color. These are obtained by dividing the feature maps into columns of single-pixel width. Now, a question might come to your mind that why are we dividing the feature maps by columns. The answer to this question lies in the receptive field concept. The receptive field is defined as the region in the input image that a particular CNN’s feature map is looking at. For instance, for the above input image, the receptive field of each feature vector corresponds to a rectangular region in the input image as shown below.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

And each of these rectangular regions is ordered from left to right. Thus, each feature vector can be considered as the image descriptor of that rectangular region. These feature vectors are then fed to a bi-directional LSTM. Because of the softmax activation function, this LSTM layer outputs the probability distribution at each time step over the character set. To obtain the per-timestep output, we can either take the max of the probability distribution at each time step or apply any other method.

But as you might have noticed in the above image that these feature vectors sometimes may not contain the complete character. For instance, see the below image where the 2 feature vectors marked by red color contains some part of the character “S”.

Thus, in the LSTM output, we may get repeated characters as shown below by the red box. We call these per-frame or per-timestep predictions.

Now, here comes the problem. As we have already discussed, that for the text recognition, the training data consists of images and the corresponding text as shown below.

Training Data for text recognition

Thus, we only know the final output and we don’t know the per-timestep predictions. Now, in order to train this network, we either need to know the per-timestep output for each input image or we need to develop a mechanism to convert either per-timestep output to final output or vice-versa.

So, the problem is how to align the final output with the per-timestep predictions in order to train this network?

Approach 1

One thing we can do is devise a rule like “one character corresponds to some fixed time steps”. For instance, for the above image, if we have 10 timesteps, then we can repeat “State” as “S, S, T, T, A, A, T, T, E, E” (repeat each character twice) and then train the network. But this approach can be easily violated for different fonts, writing styles, etc.

Approach 2

Another approach can be to manually annotate the data for each time step as shown below. Then train the network using this annotated data. The problem with this approach is that this will be very time consuming for a reasonably sized dataset.

Annotation for each timestep

Clearly, both the above naïve approaches have some serious downsides. So, isn’t there any efficient way to solve this? This is where the CTC comes into picture.

Connectionist Temporal Classification(CTC)

This was introduced in 2006 and is used for training deep networks where alignment is a problem. With CTC, we need not to worry about the alignments or the per-timestep predictions. CTC takes care of all the alignments and now we can train the network only using the final outputs. So, in the next blog, let’s discuss how the CTC algorithm works. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Text Recognition Datasets

In the previous blog, we build our own Text Recognition system from scratch using the very famous CNN+RNN+CTC based approach. As you might remember, we got pretty decent results. In order to further fine-tune our model, one thing we can do is more training. But for that, we need more training data. So, in this blog, let’s discuss some of the open-source text recognition datasets available and how to create synthetic data for text recognition. Let’s get started.

Open Source Datasets

Below are some of the open source text recognition datasets available.

  • The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
  • MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
  • IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
  • The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

Then, we will create a word of length word_size and starting with the current character as shown below.

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

Finally, save the image and the corresponding text file as shown below.

Below is the full code

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of EAST

In the previous blog, we discussed the theory behind the EAST algorithm. If you remember, we stated that this algorithm is both accurate and efficient. So, in this blog, let’s find it out. For this, first, we will run the EAST algorithm using its Github repository, and then we will analyze the results. So, let’s get started. Here, I’m using a Linux system.

Clone the Repository

First, search “EAST Github” in the browser. You will find several EAST implementations but in this blog, we will use the one provided by argman. So, open this and clone the repository. In order to clone the repository, you can either use git or download it as a zip file. To install git, you can run the following command.

Once you have installed git, clone the repository using the following command.

This will clone the repository into your system as shown below.

Compile lanms

As you might remember, in the previous blog, we discussed that the EAST algorithm uses a Locality-Aware NMS (lanms) instead of the standard NMS. Now, you need to compile the lanms. Why? because this GitHub implementation contains the lanms code written in C++ (See the lanms folder). So, in order to make it work with Python, we need to generate an adaptor.so file. This can be done as follows.

First, we need to install the g++ compiler in order to compile the adaptar.cpp file. This can be done using the following command.

This contains the essential tools for building most other packages from source (e.g. C/C++ compiler, libc, and make).

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Next, open the __init__.py file present inside the lanms folder and comment out the if condition as shown below.

Again open the terminal and change the directory to the lanms folder. After this, run the make command as shown below. This will generate the required adaptor.so file in the lanms folder.

Test the model

Now, to test the model, either we need to first train it or find some pre-trained weights, if available. Luckily, pre-trained weights are available. You can download it from here. These are trained on the ICDAR-2013 and ICDAR-2015 datasets.

After downloading the pre-trained weights, extract them and place them inside the EAST folder. Now, to test the model, open the terminal and change the directory to the EAST folder. Also activate the virtual environment if any. Then type the following command by giving the arguments.

For arguments, first, we need to specify the test images path as a “test_data_path” argument. Second, we need to specify the recently downloaded checkpoints path as a “checkpoint_path” argument. And lastly, we need to specify the output directory path as an “output_dir” argument as shown below. This will automatically create the output directory if not present.

This will run the EAST algorithm on the test images we provided. Below an output image is shown.

In the next blog, we will explore different text detection datasets that are available. We will also learn how we can create our own text detection dataset. This will help us with training and fine-tuning our EAST model further. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition: Introduction and its Applications

Hello! and welcome to this series on Optical Character Recognition, also known as OCR in short. Most of you already might be familiar with this OCR term, if not, no worries, we will be discussing everything in detail in this series. So, in this blog, let’s first start by giving you an OCR introduction followed by some motivation, that why you should invest your time in learning this. So, let’s get started.

What is OCR?

Optical character recognition is a method of converting the text present in images or scanned documents to a machine-readable format that can later be edited, searched, and used for further processing.

The term machine-readable format means the text in electronic form or simply the text that you can select, edit, process, etc. Let’s take an example to understand what this actually means.

Suppose we are given the below image. Clearly, as we can see, there is some text present in the image. But for a computer, this is nothing but an array of pixel values.

The computer doesn’t know whether the image contains text, car, or bus, etc. We can’t select, edit, or do any further processing on the text. Thus, this is not a machine-encoded text.

So, what the OCR system will do is, digitize the printed text, that is, take this image as an input, and outputs a text file containing all the text present in the image. Now, you can do anything with the text you want.

So, now you know coarsely what an OCR is. In the next section, let’s try to understand why you should invest your time in learning this?

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Applications

Let’s take a few real-life examples where OCR has made our life easy. You already might have encountered these in your day-to-day life but might not have thought about how these work.

Automatic Data Entry

This is one of the most prominent applications of OCR. Earlier people used to manually enter the details from business documents, invoices, passports, receipts, etc. But now with the help of OCR, most of these tasks are now automated. Also instead of managing a colossal pile of paper documents, now everything is archived digitally.

For instance, in banks instead of manually entering the cheque details, the cheque is first scanned, and then the OCR extracts all the useful information such as account number, amount, etc. thus leading to faster processing. Similarly, at airports, your passport information is extracted from the Machine-readable zone (MRZ) leading to faster processing.

Vehicle Number Plate Recognition

Almost everyone might have seen or heard about this application. OCR is used to recognise the vehicle registration plate which can then be used for vehicle tracking, toll collection, etc. This was invented in 1976 in Britain but became popular only after 90’s.

Self-driving cars

Most of you might be thinking of how and where the OCR is used in self-driving cars. The answer is recognizing the traffic signs. The autonomous car uses OCR to recognize the traffic signs and thus take actions accordingly. Without this, the self-driving car will pose a risk for both pedestrians and other vehicles on road.

Book Scanning

OCR is widely used in digitizing scanned documents. For instance, you might have heard about Project Gutenberg that tries to digitize and archive cultural works. Most of these items are available free of cost. Similarly, Google Books scans books, converts them to text using OCR, and stores them in its digital database.

For Visually Impaired persons

In this, we can use OCR to extract the text and use text-to-speech to read the extracted text. This approach was first used around 1976.

Your Personal Translator

Suppose you are roaming in a country that doesn’t speak your language. You find a signboard that you are not able to understand. Obviously, you can ask someone. But OCR can also help you out in this situation. Just click a photo of that signboard, run the OCR (constraint: language must be known), extract the text and then use google translate or any other API to translate it to your native language. Isn’t this cool!!!

These are just a few of many OCR applications. From these applications, we can see that because of OCR most of the work has now been automated which helps in saving time, money, manpower, etc. Hope these applications have motivated you enough to learn OCR. From the next blog, we will start discussing how the OCR works. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection

In the previous blogs, we discussed different pre-processing techniques such as noise removal, skew correction, etc. The main objective of this pre-processing step was to make the image suitable for the next pipeline components such as text detection, and recognition. Now, in this blog, let’s understand the text detection step in detail.

Text Detection

Text detection simply means finding the regions in the image where the text can be present. For instance, see the below image where green colored bounding boxes are drawn around the detected text.

While performing text detection, you may encounter two types of cases

  • Images with Structured text: This refers to the images that have a clean/uniform background with regular font. Text is mostly dense with proper row structure and uniform text color. For instance, see the below image.
  • Images with Unstructured text: This refers to the images with sparse text on a complex background. The text can have different colors, size, fonts, and orientations and can be present anywhere in the image. Performing text detection on these images is known as scene text detection. For instance, see the below image.

Now, if I ask, which one of the above two cases looks more challenging. Obviously, the answer would be the scene text detection one, due to various complexities as discussed above. And that’s why this is an active research topic in computer vision.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

While performing text detection, you have 3 options. Either you do

  • Character-by-Character detection
  • Word-by-Word detection
  • Line-by-Line detection

All three are shown below.

Nowadays, we mostly prefer doing word or line detection. This is because the character detection is generally slow and is somewhat more challenging as compared to the other two.

Mostly, the text detection methods can be broadly classified into 2 categories

  • Conventional methods
  • Deep-learning based methods

Conventional methods rely on manually designed features. For instance, Stroke width Transform (SWT) and Maximally Stable Extremal Regions (MSER) based methods generally extracts the character candidates via edge detection or extremal region extraction. While in the deep learning based methods, features are learned from the training data. These are generally better than the conventional ones, in terms of both accuracy and adaptability in challenging scenarios.

Further, the deep learning based methods can be classified into

  • Multi-step methods
  • Simplified pipeline

To understand these, take a look at the below image where the pipeline of several state-of-the-art text detection methods is shown. The first 3 methods (a,b,c) fall into the multi-step category (each box denotes 1 step) while the last 2 (d,e) are the ones with a simplified pipeline.

In this series, we will be mainly focussing on the methods with the simplified pipeline. By the way, the last 2 methods (d,e) shown above are known as Connectionist Text Proposal Network (CTPN) and Efficient and Accurate Scene Text Detector (EAST) respectively. Both of these are very famous text detection methods!!!

In the next blog, let’s discuss the EAST algorithm in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Recognition

In the previous blogs, we covered the OCR text detection step. Now, it’s time to move on to the OCR’s next pipeline component, which is Text Recognition. So, let’s get started.

Text Recognition

As you might remember, in the text detection step, we segmented out the text regions. Now, it’s time to recognize what text is present in those segments. This is known as Text Recognition. For instance, see the below image where we have segments on the left and the recognized text on the right. This is what we want, i.e. recognize the text present in the segments.

So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. In general, the Text Recognition step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. For instance, see the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Now, you may ask Why coordinates? This will become clear when we will discuss Restructuring (the next step).

Similar to text detection, text recognition has also been a long-standing research topic in computer vision. Traditional text recognition methods generally consist of 3 main steps

  • Image pre-processing
  • character segmentation
  • character recognition

That is they mainly work at a character level. But when we deal with images having a complex background, font, or other distortions, character segmentation becomes a really challenging task. Thus, to avoid character segmentation, two major techniques are adopted

  • Connectionist Temporal Classification (CTC) based
  • Attention-based

In the next blog, let’s understand in detail, what is CTC and how it is used in Text Recognition. Then we will move to the attention-based algorithms. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

OCR Pipeline

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Test image for OCR

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

OCR image pre-processing

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text detection

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Text recognition OCR

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Restructuring OCR

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.