Category Archives: Computer Vision

Multi-Class Classification

In the previous blog, we discussed the binary classification problem where each image can contain only one class out of two classes. So, in this blog, we will extend this to the multi-class classification problem. In multi-class problem, we classify each image into one of three or more classes. So, let’s get started.

Here, we will use the CIFAR-10 dataset, developed by the Canadian Institute for Advanced Research (CIFAR). The CIFAR-10 dataset consists of 60000 (32×32) color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The classes are completely mutually exclusive. Below are the classes in the dataset, as well as 10 random images from each class.

Source: CIFAR-10

1. Load the Data

CIFAR-10 dataset can be downloaded by using any of the two methods:

  • Using Keras builtin datasets
  • From the official website

Method-1

Downloading using the Keras builtin datasets is pretty straightforward and simple. It’s already transformed into the shape appropriate for the CNN input. No headache, just write one line of code and you are done.

Method-2

The data can also be downloaded from the official website. But the only thing is that it is not in the standard format that can be inputted directly to the model. Let’s see how the dataset is arranged.

The dataset is broken into 5 files so as to prevent your machine from running out of memory. Each file contains a dictionary of data and the corresponding labels. Data is a 10000×3072 array where 10000 is the number of images and 3072 are the pixel values in row-major order. So, the first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. You need to convert it into a (32,32) color image.

Steps:

  • First, unpickle all the train and test files
  • Then convert the image format to (width x height x num_channel)

Then append all the unpickled train files into one array.

Split the data into train and validation

Because the training data contains images in the random order thus simple splitting will be sufficient. Another way is to take some % of images from each of the 5 train files to constitute a validation set.

To make sure that this splitting leads to the uniform proportion of examples for each class, we can plot the counts of each class in the validation dataset. Below is the bar plot. Looks like all the classes are uniformly distributed in the validation set.

Model Architecture

Since the images contain a diverse amount of information, we will be needing a bigger network. Bigger the network more will be the chances of overfitting, So, to prevent this we may need to apply some regularization techniques.

Data Augmentation

Fit the model using the fit_generator

Let’s visualize the training events using the History() callback.

That seems pretty nice. You can play with the architecture, optimizers and other hyperparameters to obtain even more accuracy. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Multi-Label Classification

In the previous blogs, we discussed binary and multi-class classification problems. Both of these are almost similar. The basic assumption underlying these two problems is that each image can contain only one class. For instance, for the dogs vs cats classification, it was assumed that the image can contain either cat or dog but not both. So, in this blog, we will discuss the case where more than one classes can be present in a single image. This type of classification is known as Multi-label classification. Below picture explains this concept beautifully.

Source: cse-iitk

Some of the most common techniques for solving multi-label classification problems are

  • Problem Transformation
  • Adapted Algorithm
  • Ensemble approaches

Here, we will only discuss only Binary Relevance, a method that falls under the Problem Transformation category. If you are curious about other methods, you can read this amazing review paper.

In binary relevance, we try to break the problem into a number of binary classification problems. So, now for each class available, we will ask if it is present in the image or not. As we already know that the binary classification uses ‘sigmoid‘ as the last layer activation function and ‘binary_crossentropy‘ as the loss function. So, here we will also use the same. Rest all things are the same.

Now, let’s take a dataset and see how to implement multi-label classification.

Problem Definition

Here, we will take the most common Movie Genre classification based on the poster images problem. Because a movie can belong to more than one genre, for instance, comedy, romance, etc. and hence is a multi-label classification problem.

Dataset

You can download the original dataset from here. This contains two files.

  • Movie_Poster_Dataset.zip – The poster images
  • Movie_Poster_Metadata.zip – Metadata of each poster image like ID, genres, box office, etc.

To prepare the dataset, we need images and corresponding genre information. For this, we need to extract the genre information from the Movie_Poster_Metadata.zip file corresponding to each poster image. Let’s see how to do this.

Note: This dataset contains some missing items. For instance, check the “1982” folder in the Movie_Poster_Dataset.zip and Movie_Poster_Metadata.zip. The number of poster images and the corresponding genre information is missing for some movies. So, we need to perform EDA and remove these files.

Steps to perform EDA:

  1. First, we will extract the movie name and corresponding genre information from the Movie_Poster_Metadata.zip file and create a Pandas dataframe using these.
  2. Then we will loop over the poster images in the Movie_Poster_Dataset.zip file and check if it is present in the dataframe created above. If the poster is not present, we will remove that movie from the dataframe.

These two steps will ensure that we are only left with movies that have poster images and genre information. Below is the code for this.

Because the encoding of some files is different, that’s why 2 for loops. Below are the steps performed in the code.

  • First, open the metadata file
  • Read line by line
  • Extract the information corresponding to the ‘Genre’ and ‘imdbID’
  • Append them into the list and create a dataframe

Now for the second step, we first append all the poster images filenames in the list.

Then check if the name is present in the dataframe or not. If not, we will remove the rows from the dataframe or create a new dataframe.

Be sure that we have no duplicates in the dataframe.

So, finally, we are ready with our cleaned dataset with 8052 images containing overall 25 classes. The dataframe is shown below.

Format 1

One can also convert this dataframe into the common format as shown below

Format 2

This can be done using the following code.

In this post, we will be using Format 1. You can use any. Here, we will be using the Keras flow_from_dataframe method. For this, we need to place all the images under one directory. Currently, all the images are in separate folders such as 1980, 1981, etc. Below is the code that places all the poster images in a single folder ‘original_train‘.

Model Architecture

Since this is a sparse multilabel classification problem, accuracy is not a good metric for this. The reason for this is shown below.

if the predicted output was [0, 0, 0, 0, 0, 1] and the correct output was [0, 0, 0, 0, 0, 0], my accuracy would still be 5/6.

So, you can use other metrics like precision, recall, f1 score, hamming loss, top_k_categorical_accuracy, etc.

Here, I’ve used both to show how accuracy instantly reaches 90+ from the starting epoch and thus is not a correct metric.

flow_from_dataframe()

Here, I split the data into training and validation sets using the validation_split argument of ImageDataGenerator. You can read more about the ImageDataGenerator here.

Below are some of the poster images all resized into (400,300,3).

You can also check which labels are assigned to which class using the following code.

This prints a dictionary containing class names as keys and labels as values.

Let’s start training…

See how accuracy is reaching 90+ within few epochs. As stated earlier this is not a good evaluation metric for multi-label classification. On the other hand, top_k_categorical_accuracy is showing us the true picture.

Clearly, we are doing a pretty decent job. Considering the fact that training data is small and the complexity of the problem is large(25 classes). Moreover, some classes like comedy, etc dominate the training data. Play with the model architecture and other hyperparameters and check how the accuracy varies.

Prediction time

For each image, let’s predict the top three predicted classes. Below is the code for this.

The actual label for this can be found out as

You can see that our model is doing a decent job considering the complexity of the problem

Let’s try another example “tt0465602.jpg“. For this the predicted labels are

By looking at the poster most of us will predict the labels as predicted by our algorithm. Actually, these are pretty close to the true labels that are [Action, Comedy, Crime].

That’s all for multi-label classification problem. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Binary Classification

In this blog, we will learn how to perform binary classification using Convolution Neural Networks. Here, we will be using the classic dogs vs cats dataset, where we have to classify an image as belonging to one of these two classes. So, let’s get started.

Downloading the Dataset

This dataset was made available as a part of the Kaggle competition in 2013. You can download it from here. This dataset contains 25,000 labeled images of dogs and cats in the train folder and 12,500 unlabeled images in the test folder. The size of the images in the dataset is not the same. Some samples from the dataset are shown below

Since the competition is now closed, we can’t submit the test predictions to the kaggle. Thus, to know how good we are doing, we will make the test data from 25,000 labeled images.

Preparing the Data

Here, we will split the train folder into 20,000 for training, and 2500 each for validation and testing. For this, we will create 3 folders corresponding to each train, validation and test set. In these folders, we will create 2 sub-folders as cats and dogs. You can do this manually but here we will be using the Python os module. The code for creating folders and sub-folders is shown below

The above code will create folders and sub-folders in the original path specified above. Now, we will put the images in these folders. The below code places 20,000 images in train, 2500 each in the validation and test folder created above.

Now, let’s display a sample image from say “train_cats_dir”. This is done using the following code.

Data Pre-processing

The data must be processed in an appropriate form before feeding in the neural network. This includes changing the data into numpy arrays, normalizing the values between 0 and 1 or any other suitable range, etc. This can be easily done using the keras ImageDataGenerator class. This is shown in the code below

Here, we will use the flow_from_directory method to generate batches of data.

Build Model

Since this is a binary classification problem, we use the sigmoid activation function in the last layer. The model architecture we will use is shown below.

For the compilation step, we will use the Adam optimizer with the binary crossentropy loss.

Callbacks

To have some control over the training, one must use callbacks. Here, we will be using ModelCheckpoint callback which save the model weights whenever the validation accuracy improves.

Fit Model

Visualize Training

Let’s visualize how the loss and accuracy vary during the training process. This is done using the History() object as shown below

Clearly, our model starts overfitting after 8 epoch. We know that to prevent overfitting, we can

  • Perform Data Augmentation
  • Use Dropout
  • Reduce the capacity of the network
  • Use Regularization etc.

So, let’s use Data Augmentation and Dropout and see how our model performs.

Data Augmentation

In this, we produce more examples from the existing examples by various operations such as rotating, translating, flipping, etc. Fortunately, in Keras, all these transformations can be performed using ImageDataGenerator class. Below is the code for this

Dropout

In this, we randomly turn off some neurons (setting to zero) during training. This can be easily implemented using the Keras Dropout layer. Here, I’ve just added a single Dropout layer before the Dense layer, with a dropout rate of 50%. Rest all the model is same as above.

Just change the train_datagen and add Dropout layer. Then, train the model the same as we did above. Let’s visualize the training process

Clearly, by looking at the plots, we are no longer overfitting. Just by adding a Dropout layer and augmentation, we have increased the accuracy from 87 to 95%. You can further improve the accuracy by using a pre-trained model and other regularization methods.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection and Segmentation

One of the most important module in optical character recognition pipeline is the text detection and segmentation which is also called as text localization. In the previous blog, we have seen various techniques to pre-process the input image which can help in improving our OCR accuracy. In this blog, we will learn how to localize text in an image, so that we can crop them out and then feed to our text recognition module to predict text in it.

What is text detection and segmentation?

It is the process of localizing all occurrence of text present in the image into meaningful units such as characters, words, and text lines. Then make segments of each of these units.

Character-based detection first detects individual characters and then group them into words. One way to do this is to locate characters by classifying Extremal Regions(MSER) and then groups the detected characters by an exhaustive search method.

Word-based detection usually works in a similar fashion as object detection. You can use Faster R-CNN and YOLO algorithms to perform this.

Text-line based detection detects text lines and then break it into individual words.

There are basically two types of text images that are fed to the text recognition module as inputs. One is scanned documents and others are natural scene text like street signs, storefront texts, etc.

Scanned Documents

Scanned documents generally have hundreds or thousands of words in it. We can apply deep neural networks like faster R-CNN and YOLO to localize words present in the documents. But sometimes these may not be able to localize all text present in the images because these algorithms are generally trained to detect less number of objects in the image. In that case, we need to apply some post-processing after deep nets to recognize remaining texts.

Another OpenCV method which we can be used for scanned documents is Maximally Stable Extremal Regions(MSER) using OpenCV.

MSER is a method that is used for blob detection in images. Using this method we can get the coordinates of the text regions and then we can generate the bounding boxes around each word in the image. Through which we can get the required input images to our text recognition module.

Natural Scenes

Natural scenes contain a lesser number of words in it but consist of other problems like distortions, occlusions, directional blur, cluttered background, etc. To overcome these problems we need to develop some deep learning algorithm that is mainly focused on natural scene texts ignoring above distortions. There are some robust open source algorithms available like EAST, CTPN, TextBoxes++, PixelLink and etc. These algorithms can also be used for localizing texts in the scanned documents but then you need to do some post processing to detect all text present in the image as I have mentioned earlier.

Source

Till now we have seen what is text segmentation and different algorithms to localize texts in an image. In the next blog, we will deep dive into these algorithms and figure out how we can implement it in our OCR pipeline.

Next Blog: Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Detection and Segmentation Part-II

In the last blog, we have seen what is text detection and different types of algorithms to perform it, In this blog, we will learn more about text detection algorithms.

Efficient and Accurate Scene Text Detector(EAST)

It is a deep learning text detection method which has two stages one is fully convolutional network(FCN) and second is non-max suppression(NMS) merging stage. In FCN it uses U-shape network which directly produces text regions either word level or text line level. Here is the diagram of FCN used in the algorithm.

U-shape FCN uses features from different layers of PVANet and then merge them to produce the outputs. The yellow boxes are different layers of PVANet and green boxes are merging layers of feature extracted from PVANet. The reason behind this merging branch is to produce outputs for both small word regions and large word regions. Low-level features will help in finding small word regions and high-level features will help in finding large word regions. This network will output geometries either in the form of RBOX(containing 5 values of which 4 are top and left coordinate, width, height and one is rotation angle) or QUAD( 4 coordinates of a rectangle) with one score map to tell about the confidence level prediction of text in it.

In second, NMS merging stage, it uses thresholding to exclude out overlapping geometries and produce the most accurate geometries for the text regions.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  1. Clone the repository into your directory: ” git clone https://github.com/argman/EAST.git”
  2. Download its pretrained model and put inside EAST directory.
  3. Before testing it you need to compile the lanms.
  4. To test this model, go to your EAST directory and then run following command from terminal:

You can also train this model with your dataset either from scratch or use pre-trained model provided earlier. To train this model you need to provide dataset path and dataset should consist of training images with corresponding text file which will have coordinates of text present in the image.

Connectionist Text Proposal Network(CTPN)

CTPN is a deep learning method that accurately predicts text lines in a natural image. It is an end to end trainable model consists of both CNN and RNN layers. In general, the length of a text line varies frequently. To solve this problem authors of this paper have considered text lines as a sequence of fine-scale text proposals, where each proposal are having a fixed width of 16 pixels with varying height. Let’s see the below image.

In the above figure, each vertical rectangular box is a fine text proposal. To go through model’s architecture see below figure:

The input image is being sent to VGG-16 model. Features output from conv_5 layer(the layer just before fully connected layers) of VGG-16 model is taken. A sliding window of size 3X3 is moved over VGG-16 output features and then fed sequentially to RNN network which consists of 256D bi-directional LSTM. This LSTM layer is connected to 512D fully connected layer which will next produce the outputs.

Now see the generation of output using this algorithm.

  • This algorithm uses anchor boxes to detect the text of different height. Let say we use k anchor boxes then output will consist of three main parts.
  • One is 2k vertical coordinates where each anchor box have its y coordinate (center position of box) and height of anchor box.
  • Second 2k text/non-text scores and,
  • third is k side refinement offset.
  • Here they have used 10 anchor boxes of varying height between 11 to 273 pixels. For this they have fixed the horizontal location and predicted only the vertical heights.
  • On the basis of text/non-text scores, sequential text proposal are merged and text-lines are formed. Side refinement offsets are used to refine the two end points of a text line.

To implement it in our OCR pipeline, we can use it’s GitHub Repository. To make it workable use the following steps:

  • Clone the repository into your directory: ” git clone https://github.com/eragonruan/text-detection-ctpn.git”
  • Go to “text-detection-ctpn-banjin-dev” directory
  • Run following command one by one:
  • Download pretrained checkpoint from google drive
  • Extract it and put checkpoints_mlt/ in text-detection-ctpn/
  • Now put your text file in data/demo and output will be in data/res
  • Now run the following command to check the outputs

You can also train this model using your own data, just follow the steps provide in GitHub Repository.

A Single Shot Oriented Scene Text Detector(TextBoxes++)

It is an end-to-end trainable fast scene text detector which can even detect oriented text present in the image. It does not require any post processing except non-maximum suppression. The basic idea is taken from the object detection algorithm SSD(single shot detector). SSD aims to predict general objects in an image but when it comes for text detection it fails. To improve this on text dataset TextBoxes++ have been introduced. Let’s see the model’s architecture:

First 13 layers are from VGG16 model. Then 2 fully connected layers of VGG-16 are converted into convolution layers which are followed by 8 convolution layers. Finally, 6 Text-Box layers are connected to 6 different intermediate convolution layers of the model. These 6 Text-Box layers are output layer and at test time non-max separation is applied to merge the result of these 6 to predict the best ones.

Text-Box layers are the key component of TextBoxes++. These are also convolutional layer which predicts both presences of text and bounding box coordinates. It includes both oriented bounding boxes and minimum horizontal boxes. Text-Box layers are designed to tackle the problem of variable length words.

You can find it’s GitHub Repository here. In GitHub they have also implemented CRNN(convolution recurrent neural network) to recognize text detected by the TextBoxes++. To implement it, you can follow their GitHub directions. Here are some results of TextBoxes++.

Source

That’s enough for text detection, in the next blog, we will learn about text recognition. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Text Recognition

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Recognition

In the previous blogs, we covered the OCR text detection step. Now, it’s time to move on to the OCR’s next pipeline component, which is Text Recognition. So, let’s get started.

Text Recognition

As you might remember, in the text detection step, we segmented out the text regions. Now, it’s time to recognize what text is present in those segments. This is known as Text Recognition. For instance, see the below image where we have segments on the left and the recognized text on the right. This is what we want, i.e. recognize the text present in the segments.

So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. In general, the Text Recognition step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. For instance, see the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Now, you may ask Why coordinates? This will become clear when we will discuss Restructuring (the next step).

Similar to text detection, text recognition has also been a long-standing research topic in computer vision. Traditional text recognition methods generally consist of 3 main steps

  • Image pre-processing
  • character segmentation
  • character recognition

That is they mainly work at a character level. But when we deal with images having a complex background, font, or other distortions, character segmentation becomes a really challenging task. Thus, to avoid character segmentation, two major techniques are adopted

  • Connectionist Temporal Classification (CTC) based
  • Attention-based

In the next blog, let’s understand in detail, what is CTC and how it is used in Text Recognition. Then we will move to the attention-based algorithms. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-1)

In the earlier blogs, we learned various stages of optical character recognition pipeline. In this blog, we will create a convolutional recurrent neural network with CTC (Connectionist Temporal Classification) loss to implement our recognition model.

We will use the following steps to create our text recognition model.

  • Collecting Dataset
  • Preprocessing Data
  • Creating Network Architecture
  • Defining Loss function
  • Training model
  • Decoding outputs from prediction

Dataset

In this blog, we will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. Here I have used only 135000 images for the training set and 15000 images for validation dataset. This data contains text image segments which look like images shown below:

To download the dataset either you can directly download from this link or use the following commands to download the data and unzip.

Preprocessing

Now we are having our dataset, to make it acceptable for our model we need to use some preprocessing. We need to preprocess both the input image and output labels. To preprocess our input image we will use followings:

  • Read the image and convert into a gray-scale image
  • Make each image of size (128,32) by using padding
  • Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
  • Normalize the image pixel values by dividing it with 255.

To preprocess the output labels use the followings:

  • Read the text from the name of the image as the image name contains text written inside the image.
  • Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26 etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
  • Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

In preprocessing step we also need to create two other lists: one is label length and other is input length to our RNN. These two lists are important for our CTC loss( we will see later ). Label length is the length of each output text label and input length is the same for each input to the LSTM layer which is 31 in our architecture.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Following is the code for our preprocessing step:

Now you might have got some feeling about the training and validation data generation for our recognition model. In the next blog, we will use this data to train and test our neural network.

Next Blog: Creating a CRNN model to recognize text in an image (Part-2)

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-2)

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

  1. The convolutional neural network to extract features from the image
  2. Recurrent neural network to predict sequential output per time-step
  3. CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

  1. Input shape for our architecture having an input image of height 32 and width 128.
  2. Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
  3. Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
  4. Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
  5. Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
  6. Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Generating Dataset

The first step to create any deep learning model is to generate the dataset. In continuation of our optical character recognition pipeline, in this blog, we will see how we can get our training and test data.

In our OCR pipeline first, we need to get data for both segmentation and recognition(text). For the segmentation part, data will consist of images and corresponding files containing coordinates for words present in the image. Let’s see an example.

For recognition part data will consist of images and their corresponding text files. Here segmented images will contain a single word.

Image and Text

Open Source Dataset:

There are some open source dataset available for our pipeline. For the segmentation part here are some useful open source datasets.

Now let’s see some of the open source dataset for text recognition(images and their corresponding texts)

Synthetic Data:

In some cases, training your OCR model with synthetic data can also be useful. You can create your own synthetic data using some python script. You can also add some geometric transformation to simulate the real world distortion into the data. For an example here is a script to generate synthetic data for text recognition:

In the above code, I have generated English words images and corresponding text files using different font types with a font size of 14. Segmented images will look like below:

Five Segmented Images generated from above code

Annotation Tools and Manual Data:

Another way to create segmentation text dataset is by using annotation tools. In this case, you need to collect images manually or you can get images from the internet, then you need to manually annotate text in the images (Bounding Boxes). Annotation tools like labelimg can work in this case.

That’s all to generate the dataset. In the next blog, we will see image preprocessing steps to apply to these datasets. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Image Preprocessing

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

OCR Pipeline

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Test image for OCR

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

OCR image pre-processing

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text detection

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Text recognition OCR

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Restructuring OCR

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.