Tag Archives: text recognition

Text Recognition Datasets

Open Source Datasets

Below are some of the open source text recognition datasets available.

The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

1 2	# create font list font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

word_size = random.randrange(0,10)

1	word_size = random.randrange(0,10)

Then, we will create a word of length word_size and starting with the current character as shown below.

# create word starting with the current character
char_list_copy = char_list.copy()
char_list_copy.remove(char_list[i])
new_word = char_list[i]
for _ in range(word_size):
    new_word +=random.choice(char_list_copy)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

font = ImageFont.truetype(fonts+".ttf",14)

1	font = ImageFont.truetype(fonts+".ttf",14)

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

img=Image.new("RGBA", (110,20),(255,255,255))
draw = ImageDraw.Draw(img)
draw.text((5, 0),new_word,(0,0,0),font=font)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

Finally, save the image and the corresponding text file as shown below.

img.save('english/'+new_word+".png")

# Save the word in the text file
with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
    txt_file.write(new_word)

img.save('english/'+new_word+".png")

# Save the word in the text file

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below is the full code

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)
    
# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']  

# generate images for each fonts
for fonts in font_lst:
    for i in tqdm(range(1)):
        for i in range(len(char_list)):
            # Choose a random word size
            word_size = random.randrange(0,10)
            # create word starting with the current character
            char_list_copy = char_list.copy()
            char_list_copy.remove(char_list[i])
            new_word = char_list[i]
            for _ in range(word_size):
                new_word +=random.choice(char_list_copy)
            # Draw the word on the image
            font = ImageFont.truetype(fonts+".ttf",14)
            img=Image.new("RGBA", (110,20),(255,255,255))
            draw = ImageDraw.Draw(img)
            draw.text((5, 0),new_word,(0,0,0),font=font)
            # Save the image and the corresponding text file            
            img.save('english/'+new_word+".png")

            with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
                txt_file.write(new_word)

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

# create font list

font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

# generate images for each fonts

for fonts in font_lst:

for i in tqdm(range(1)):

for i in range(len(char_list)):

# Choose a random word size

word_size = random.randrange(0,10)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

# Draw the word on the image

font = ImageFont.truetype(fonts+".ttf",14)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

# Save the image and the corresponding text file

img.save('english/'+new_word+".png")

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline: Text Recognition

Text Recognition

As you might remember, in the text detection step, we segmented out the text regions. Now, it’s time to recognize what text is present in those segments. This is known as Text Recognition. For instance, see the below image where we have segments on the left and the recognized text on the right. This is what we want, i.e. recognize the text present in the segments.

So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. In general, the Text Recognition step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. For instance, see the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Now, you may ask Why coordinates? This will become clear when we will discuss Restructuring (the next step).

Similar to text detection, text recognition has also been a long-standing research topic in computer vision. Traditional text recognition methods generally consist of 3 main steps

Image pre-processing
character segmentation
character recognition

That is they mainly work at a character level. But when we deal with images having a complex background, font, or other distortions, character segmentation becomes a really challenging task. Thus, to avoid character segmentation, two major techniques are adopted

Connectionist Temporal Classification (CTC) based
Attention-based

In the next blog, let’s understand in detail, what is CTC and how it is used in Text Recognition. Then we will move to the attention-based algorithms. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Creating a CRNN model to recognize text in an image (Part-2)

43 Replies

In the previous blog, we have seen how to create training and validation dataset for our recognition model( Download and preprocess ). In this blog, we will create our model architecture and train it with the preprocessed data.

You can find full code here.

Model = CNN + RNN + CTC loss

Our model consists of three parts:

The convolutional neural network to extract features from the image
Recurrent neural network to predict sequential output per time-step
CTC loss function which is transcription layer used to predict output for each time step.

Model Architecture

Here is the model architecture that we used:

This network architecture is inspired by this paper. Let’s see the steps that we used to create the architecture:

Input shape for our architecture having an input image of height 32 and width 128.
Here we used seven convolution layers of which 6 are having kernel size (3,3) and the last one is of size (2.2). And the number of filters is increased from 64 to 512 layer by layer.
Two max-pooling layers are added with size (2,2) and then two max-pooling layers of size (2,1) are added to extract features with a larger width to predict long texts.
Also, we used batch normalization layers after fifth and sixth convolution layers which accelerates the training process.
Then we used a lambda function to squeeze the output from conv layer and make it compatible with LSTM layer.
Then used two Bidirectional LSTM layers each of which has 128 units. This RNN layer gives the output of size (batch_size, 31, 63). Where 63 is the total number of output classes including blank character.

Let’s see the code for this architecture:

# input with shape of height=32 and width=128 
inputs = Input(shape=(32,128,1))

# convolution layer with kernel size (3,3)
conv_1 = Conv2D(64, (3,3), activation = 'relu', padding='same')(inputs)
# poolig layer with kernel size (2,2)
pool_1 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_1)

conv_2 = Conv2D(128, (3,3), activation = 'relu', padding='same')(pool_1)
pool_2 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_2)

conv_3 = Conv2D(256, (3,3), activation = 'relu', padding='same')(pool_2)

conv_4 = Conv2D(256, (3,3), activation = 'relu', padding='same')(conv_3)
# poolig layer with kernel size (2,1)
pool_4 = MaxPool2D(pool_size=(2, 1))(conv_4)

conv_5 = Conv2D(512, (3,3), activation = 'relu', padding='same')(pool_4)
# Batch normalization layer
batch_norm_5 = BatchNormalization()(conv_5)

conv_6 = Conv2D(512, (3,3), activation = 'relu', padding='same')(batch_norm_5)
batch_norm_6 = BatchNormalization()(conv_6)
pool_6 = MaxPool2D(pool_size=(2, 1))(batch_norm_6)

conv_7 = Conv2D(512, (2,2), activation = 'relu')(pool_6)

squeezed = Lambda(lambda x: K.squeeze(x, 1))(conv_7)

# bidirectional LSTM layers with units=128
blstm_1 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(squeezed)
blstm_2 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(blstm_1)

outputs = Dense(len(char_list)+1, activation = 'softmax')(blstm_2)

act_model = Model(inputs, outputs)

# input with shape of height=32 and width=128

inputs = Input(shape=(32,128,1))

# convolution layer with kernel size (3,3)

conv_1 = Conv2D(64, (3,3), activation = 'relu', padding='same')(inputs)

# poolig layer with kernel size (2,2)

pool_1 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_1)

conv_2 = Conv2D(128, (3,3), activation = 'relu', padding='same')(pool_1)

pool_2 = MaxPool2D(pool_size=(2, 2), strides=2)(conv_2)

conv_3 = Conv2D(256, (3,3), activation = 'relu', padding='same')(pool_2)

conv_4 = Conv2D(256, (3,3), activation = 'relu', padding='same')(conv_3)

# poolig layer with kernel size (2,1)

pool_4 = MaxPool2D(pool_size=(2, 1))(conv_4)

conv_5 = Conv2D(512, (3,3), activation = 'relu', padding='same')(pool_4)

# Batch normalization layer

batch_norm_5 = BatchNormalization()(conv_5)

conv_6 = Conv2D(512, (3,3), activation = 'relu', padding='same')(batch_norm_5)

batch_norm_6 = BatchNormalization()(conv_6)

pool_6 = MaxPool2D(pool_size=(2, 1))(batch_norm_6)

conv_7 = Conv2D(512, (2,2), activation = 'relu')(pool_6)

squeezed = Lambda(lambda x: K.squeeze(x, 1))(conv_7)

# bidirectional LSTM layers with units=128

blstm_1 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(squeezed)

blstm_2 = Bidirectional(LSTM(128, return_sequences=True, dropout = 0.2))(blstm_1)

outputs = Dense(len(char_list)+1, activation = 'softmax')(blstm_2)

act_model = Model(inputs, outputs)

Loss Function

Now we have prepared model architecture, the next thing is to choose a loss function. In this text recognition problem, we will use the CTC loss function.

CTC loss is very helpful in text recognition problems. It helps us to prevent annotating each time step and help us to get rid of the problem where a single character can span multiple time step which needs further processing if we do not use CTC. If you want to know more about CTC( Connectionist Temporal Classification ) please follow this blog.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

A CTC loss function requires four arguments to compute the loss, predicted outputs, ground truth labels, input sequence length to LSTM and ground truth label length. To get this we need to create a custom loss function and then pass it to the model. To make it compatible with our model, we will create a model which takes these four inputs and outputs the loss. This model will be used for training and for testing we will use the model that we have created earlier “act_model”. Let’s see the code:

labels = Input(name='the_labels', shape=[max_label_len], dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')


def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args

    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)


loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([outputs, labels, input_length, label_length])
model = Model(inputs=[inputs, labels, input_length, label_length], outputs=loss_out)

labels = Input(name='the_labels', shape=[max_label_len], dtype='float32')

input_length = Input(name='input_length', shape=[1], dtype='int64')

label_length = Input(name='label_length', shape=[1], dtype='int64')

def ctc_lambda_func(args):

y_pred, labels, input_length, label_length = args

return K.ctc_batch_cost(labels, y_pred, input_length, label_length)

loss_out = Lambda(ctc_lambda_func, output_shape=(1,), name='ctc')([outputs, labels, input_length, label_length])

model = Model(inputs=[inputs, labels, input_length, label_length], outputs=loss_out)

Compile and Train the Model

To train the model we will use Adam optimizer. Also, we can use Keras callbacks functionality to save the weights of the best model on the basis of validation loss.

model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = 'adam')

filepath="best_model.hdf5"
checkpoint = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
callbacks_list = [checkpoint]

model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer = 'adam')

filepath="best_model.hdf5"

checkpoint = ModelCheckpoint(filepath=filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

callbacks_list = [checkpoint]

In model.compile(), you can see that I have only taken y_pred and neglected y_true. This is because I have already taken labels as input to the model earlier.

Now train your model on 135000 training images and 15000 validation images.

training_img = np.array(training_img)
train_input_length = np.array(train_input_length)
train_label_length = np.array(train_label_length)

valid_img = np.array(valid_img)
valid_input_length = np.array(valid_input_length)
valid_label_length = np.array(valid_label_length)

model.fit(x=[training_img, train_padded_txt, train_input_length, train_label_length], y=np.zeros(135000), batch_size=256, epochs = 100, validation_data = ([valid_img, valid_padded_txt, valid_input_length, valid_label_length], [np.zeros(15000)]), verbose = 1, callbacks = callbacks_list)

training_img = np.array(training_img)

train_input_length = np.array(train_input_length)

train_label_length = np.array(train_label_length)

valid_img = np.array(valid_img)

valid_input_length = np.array(valid_input_length)

valid_label_length = np.array(valid_label_length)

model.fit(x=[training_img, train_padded_txt, train_input_length, train_label_length], y=np.zeros(135000), batch_size=256, epochs = 100, validation_data = ([valid_img, valid_padded_txt, valid_input_length, valid_label_length], [np.zeros(15000)]), verbose = 1, callbacks = callbacks_list)

Test the model

Our model is now trained with 135000 images. Now its time to test the model. We can not use our training model because it also requires labels as input and at test time we can not have labels. So to test the model we will use ” act_model ” that we have created earlier which takes only one input: test images.

As our model predicts the probability for each class at each time step, we need to use some transcription function to convert it into actual texts. Here we will use the CTC decoder to get the output text. Let’s see the code:

# load the saved best model weights
act_model.load_weights('best_model_without_thresold.hdf5')

# predict outputs on validation images
prediction = act_model.predict(valid_img)

# use CTC decoder
out = K.get_value(K.ctc_decode(prediction, input_length=np.ones(prediction.shape[0])*prediction.shape[1],
                         greedy=True)[0][0])

# see the results
i = 0
for x in out:
    print(valid_orig_txt[i])
    for p in x:  
        if int(p) != -1:
            print(char_list[int(p)], end = '')       
    print('\n')
    i+=1

# load the saved best model weights

act_model.load_weights('best_model_without_thresold.hdf5')

# predict outputs on validation images

prediction = act_model.predict(valid_img)

# use CTC decoder

out = K.get_value(K.ctc_decode(prediction, input_length=np.ones(prediction.shape[0])*prediction.shape[1],

greedy=True)[0][0])

# see the results

i = 0

for x in out:

print(valid_orig_txt[i])

for p in x:

if int(p) != -1:

print(char_list[int(p)], end = '')

print('\n')

i+=1

Here are some results from the trained model:

Pretty good Yeah! Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Optical Character Recognition Pipeline

1 Reply

In the previous blog, we discussed what is OCR with some real-life applications. But we didn’t get into the detail of how the OCR works. So, in this blog, let’s understand the general pipeline used by most OCR systems. Let’s get started.

OCR Pipeline

The general OCR pipeline is shown below.

As you might have noticed, this is almost similar to how we humans recognize the text. For instance, given an image containing text, we first try to locate the text and then recognize it. This is done so fast by our eye and brain combo that we hardly even notice it.

Now, let’s try to understand each pipeline component in detail, although, it’s pretty clear from their names. Let’s take the following image as an example and see what happens at each component.

Image Pre-processing

If you have ever done any computer vision task, then you must know how important this pre-processing step is. This simply means making the image more suitable for further tasks. For instance, the input image may be corrupted with noise or is skewed or rotated. In any of these cases, the next pipeline components may give erroneous results and all your hard work goes in vain. Thus, it is always necessary to pre-process the image to remove such deformities.

As an example, I’ve corrupted the below image with some salt and pepper noise and also added some rotation. If this image is passed as it is, this will give erroneous results in further steps. So, before passing we need to correct it for noise and rotation. This corrected image is shown on the right. Don’t worry, we will discuss in detail how this correction is done.

Text Detection

Text detection, as clear from the name, simply means finding the regions in the image where text can be present. This is clearly illustrated below. See how the green color bounding boxes are drawn around the detected text regions.

Text Detection has been an active research topic in computer vision. Most of the text detection methods developed so far can be divided into conventional (e.g. MSER) and deep-learning based (e.g. EAST, CTPN, etc.). Don’t worry, if you have never heard about these. We will be covering everything in detail in this series.

Text Recognition

In the previous step, we segmented out the text regions. Now, we will recognize what text is present in those segments. This is known as Text Recognition. So, what we will do is, pass each segment one-by-one to our text recognition model that will output the recognized text. Also, we keep a track of each segment bounding box coordinates. This will be helpful while we do restructuring.

In general, this step outputs a text file that contains each segment’s bounding box coordinates along with the recognized text. See the below image(right) that contains 3 columns i.e. the segment name, coordinates, and the recognized text.

Similar to text detection, this has also been an active research topic in computer vision. Several approaches have been developed for text recognition. In this series, we will be focussing mainly on the deep-learning based approaches which can be further divided into CTC-based and Attention-based. Again, don’t worry if you haven’t heard about these terms. We will be discussing these in detail in this series.

Restructuring

In the last step, we got the recognized text along with its position in the input image. Now, it’s time to restructure it. Restructuring simply means placing the text (according to the coordinates) similar to how it was in the input image. Simply iterate over each bounding box coordinate and put the recognized text. Take a look at the below image. Compare the structure of both the restructured and the original image. Both look almost similar.

Most of you might be wondering why do we need to do this or what’s the use of restructuring. So, let’s take a simple example to understand this. Suppose we want to extract the name from the below image.

To do this, we can simply tell the computer to extract the words following the word “Name:”. This can be easily done using Regex or any NLP technique. But what if you haven’t restructured the text. In that case, this would become cumbersome as it would involve iterating over the coordinates, first finding the word “Name:” coordinates then finding the next word coordinates that lie in the same line, and then extract the corresponding word. And if the name contains 2 or 3 words, this would take even more effort.

Hope you understand this, but if not, no worries, this will become more clear when we will discuss this in detail later.

So, this completes the OCR pipeline. Now, you can do anything with the extracted text. You can search, edit, translate it, or even convert it to speech. From the next blog, we will start discussing each of these pipeline components in detail. Till then, have a great time. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

TheAILearner

Mastering Artificial Intelligence

Tag Archives: text recognition

Text Recognition Datasets

Open Source Datasets

Synthetic Data

Optical Character Recognition Pipeline: Text Recognition

Text Recognition

Creating a CRNN model to recognize text in an image (Part-2)

Model = CNN + RNN + CTC loss

Model Architecture

Loss Function

Compile and Train the Model

Test the model

Optical Character Recognition Pipeline

OCR Pipeline

Image Pre-processing

Text Detection

Text Recognition

Restructuring