Creating a CRNN model to recognize text in an image (Part-1)

In the earlier blogs, we learned various stages of optical character recognition pipeline. In this blog, we will create a convolutional recurrent neural network with CTC (Connectionist Temporal Classification) loss to implement our recognition model.

We will use the following steps to create our text recognition model.

Collecting Dataset
Preprocessing Data
Creating Network Architecture
Defining Loss function
Training model
Decoding outputs from prediction

Dataset

In this blog, we will use data provided by Visual Geometry Group. This is a huge dataset total of 10 GB images. Here I have used only 135000 images for the training set and 15000 images for validation dataset. This data contains text image segments which look like images shown below:

Source

To download the dataset either you can directly download from this link or use the following commands to download the data and unzip.

wget https://www.robots.ox.ac.uk/~vgg/data/text/mjsynth.tar.gz

tar -xvzf mjsynth.tar.gz

wget https://www.robots.ox.ac.uk/~vgg/data/text/mjsynth.tar.gz

tar -xvzf mjsynth.tar.gz

Preprocessing

Now we are having our dataset, to make it acceptable for our model we need to use some preprocessing. We need to preprocess both the input image and output labels. To preprocess our input image we will use followings:

Read the image and convert into a gray-scale image
Make each image of size (128,32) by using padding
Expand image dimension as (128,32,1) to make it compatible with the input shape of architecture
Normalize the image pixel values by dividing it with 255.

To preprocess the output labels use the followings:

Read the text from the name of the image as the image name contains text written inside the image.
Encode each character of a word into some numerical value by creating a function( as ‘a’:0, ‘b’:1 …….. ‘z’:26 etc ). Let say we are having the word ‘abab’ then our encoded label would be [0,1,0,1]
Compute the maximum length from words and pad every output label to make it of the same size as the maximum length. This is done to make it compatible with the output shape of our RNN architecture.

In preprocessing step we also need to create two other lists: one is label length and other is input length to our RNN. These two lists are important for our CTC loss( we will see later ). Label length is the length of each output text label and input length is the same for each input to the LSTM layer which is 31 in our architecture.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Following is the code for our preprocessing step:

# char_list:   'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
# total number of our output classes: len(char_list)
char_list = string.ascii_letters+string.digits

def encode_to_labels(txt):
    # encoding each output word into digits
    dig_lst = []
    for index, char in enumerate(txt):
        try:
            dig_lst.append(char_list.index(char))
        except:
            print(char)
        
    return dig_lst

# char_list: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'

# total number of our output classes: len(char_list)

char_list = string.ascii_letters+string.digits

def encode_to_labels(txt):

# encoding each output word into digits

dig_lst = []

for index, char in enumerate(txt):

try:

dig_lst.append(char_list.index(char))

except:

print(char)

return dig_lst

path = '/mnt/ramdisk/max/90kDICT32px'


# lists for training dataset
training_img = []
training_txt = []
train_input_length = []
train_label_length = []
orig_txt = []

#lists for validation dataset
valid_img = []
valid_txt = []
valid_input_length = []
valid_label_length = []
valid_orig_txt = []

max_label_len = 0

i =1 
flag = 0

for root, dirnames, filenames in os.walk(path):

    for f_name in fnmatch.filter(filenames, '*.jpg'):
        # read input image and convert into gray scale image
        img = cv2.cvtColor(cv2.imread(os.path.join(root, f_name)), cv2.COLOR_BGR2GRAY)   

        # convert each image of shape (32, 128, 1)
        w, h = img.shape
        if h > 128 or w > 32:
            continue
        if w < 32:
            add_zeros = np.ones((32-w, h))*255
            img = np.concatenate((img, add_zeros))

        if h < 128:
            add_zeros = np.ones((32, 128-h))*255
            img = np.concatenate((img, add_zeros), axis=1)
        img = np.expand_dims(img , axis = 2)
        
        # Normalize each image
        img = img/255.
        
        # get the text from the image
        txt = f_name.split('_')[1]
        
        # compute maximum length of the text
        if len(txt) > max_label_len:
            max_label_len = len(txt)
            
           
        # split the 150000 data into validation and training dataset as 10% and 90% respectively
        if i%10 == 0:     
            valid_orig_txt.append(txt)   
            valid_label_length.append(len(txt))
            valid_input_length.append(31)
            valid_img.append(img)
            valid_txt.append(encode_to_labels(txt))
        else:
            orig_txt.append(txt)   
            train_label_length.append(len(txt))
            train_input_length.append(31)
            training_img.append(img)
            training_txt.append(encode_to_labels(txt)) 
        
        # break the loop if total data is 150000
        if i == 150000:
            flag = 1
            break
        i+=1
    if flag == 1:
        break
        
# pad each output label to maximum text length

train_padded_txt = pad_sequences(training_txt, maxlen=max_label_len, padding='post', value = len(char_list))
valid_padded_txt = pad_sequences(valid_txt, maxlen=max_label_len, padding='post', value = len(char_list))

path = '/mnt/ramdisk/max/90kDICT32px'

# lists for training dataset

training_img = []

training_txt = []

train_input_length = []

train_label_length = []

orig_txt = []

#lists for validation dataset

valid_img = []

valid_txt = []

valid_input_length = []

valid_label_length = []

valid_orig_txt = []

max_label_len = 0

i =1

flag = 0

for root, dirnames, filenames in os.walk(path):

for f_name in fnmatch.filter(filenames, '*.jpg'):

# read input image and convert into gray scale image

img = cv2.cvtColor(cv2.imread(os.path.join(root, f_name)), cv2.COLOR_BGR2GRAY)

# convert each image of shape (32, 128, 1)

w, h = img.shape

if h > 128 or w > 32:

continue

if w < 32:

add_zeros = np.ones((32-w, h))*255

img = np.concatenate((img, add_zeros))

if h < 128:

add_zeros = np.ones((32, 128-h))*255

img = np.concatenate((img, add_zeros), axis=1)

img = np.expand_dims(img , axis = 2)

# Normalize each image

img = img/255.

# get the text from the image

txt = f_name.split('_')[1]

# compute maximum length of the text

if len(txt) > max_label_len:

max_label_len = len(txt)

# split the 150000 data into validation and training dataset as 10% and 90% respectively

if i%10 == 0:

valid_orig_txt.append(txt)

valid_label_length.append(len(txt))

valid_input_length.append(31)

valid_img.append(img)

valid_txt.append(encode_to_labels(txt))

else:

orig_txt.append(txt)

train_label_length.append(len(txt))

train_input_length.append(31)

training_img.append(img)

training_txt.append(encode_to_labels(txt))

# break the loop if total data is 150000

if i == 150000:

flag = 1

break

i+=1

if flag == 1:

break

# pad each output label to maximum text length

train_padded_txt = pad_sequences(training_txt, maxlen=max_label_len, padding='post', value = len(char_list))

valid_padded_txt = pad_sequences(valid_txt, maxlen=max_label_len, padding='post', value = len(char_list))

Now you might have got some feeling about the training and validation data generation for our recognition model. In the next blog, we will use this data to train and test our neural network.

Next Blog: Creating a CRNN model to recognize text in an image (Part-2)

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

0 Shares

10 thoughts on “Creating a CRNN model to recognize text in an image (Part-1)”

Deepthi 8 Feb 2020 at 9:04 pm

Why is 31 used in the below line of code?
valid_input_length.append(31)

Reply ↓
1. kang & atul Post author8 Feb 2020 at 9:28 pm
  
  Hi,
  Thanks for reading this post.
  31 is the size of time steps for the bi-lstm layer.
  You can see the architecture of the model in the next blog – https://theailearner.com/2019/05/29/creating-a-crnn-model-to-recognize-text-in-an-image-part-2/
  
  Reply ↓
Deepthi 9 Feb 2020 at 1:00 pm

I got this following error.

InvalidArgumentError: Not enough time for target transition sequence (required: 37, available: 31)29You can turn this error into a warning by using the flag ignore_longer_outputs_than_inputs
[[{{node ctc_3/CTCLoss}}]]

each time I run the model.fit, the required number changes. How and what do I change in the code? I have implemented the same code as in your post, just changing minor things according to my requirements.

Reply ↓
Manu 15 Feb 2020 at 11:16 am

which framework are you using for recognition

Reply ↓
1. kang & atul Post author15 Feb 2020 at 9:07 pm
  
  Hi,
  Thanks for reading this post. We are using Keras. To get more information about the network you can read the next blog of this series.
  https://theailearner.com/2019/05/29/creating-a-crnn-model-to-recognize-text-in-an-image-part-2/
  
  Reply ↓
  1. sneha tiwari 17 Jan 2021 at 7:03 am
    
    How to use character error rate calculator in your code.
    
    Reply ↓
lakshmi 12 Apr 2020 at 7:00 pm

hi can you please help me how to set up dataset.because i have downloaded total dataset. but when i am trying to execute with some 10% data it is giving the error as “ValueError: Error when checking input: expected input_1 to have 4 dimensions, but got array with shape (0, 1)”.Could you please help

Reply ↓
1. Alejandro Soumah 23 Apr 2020 at 12:14 am
  
  That is because you are having a problem locating the dateset , check that your dateset in unzipped and in the location that it says.
  
  Reply ↓
febe 2 Jun 2020 at 3:58 pm

how to get precision and recall from your code?

Reply ↓
Kent Chen 3 Jul 2020 at 10:57 am

Is it possible to convert the Keras model to TensorRT(I want to run it in NVIDIA Jetson Nano)? I try to convert to onnx, but failed to convert onnx to TensorRT.

Reply ↓

TheAILearner

Mastering Artificial Intelligence

Creating a CRNN model to recognize text in an image (Part-1)

Dataset

Preprocessing

10 thoughts on “Creating a CRNN model to recognize text in an image (Part-1)”

Leave a ReplyCancel reply