Text Recognition Datasets

In the previous blog, we build our own Text Recognition system from scratch using the very famous CNN+RNN+CTC based approach. As you might remember, we got pretty decent results. In order to further fine-tune our model, one thing we can do is more training. But for that, we need more training data. So, in this blog, let’s discuss some of the open-source text recognition datasets available and how to create synthetic data for text recognition. Let’s get started.

Open Source Datasets

Below are some of the open source text recognition datasets available.

The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

1 2	# create font list font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

word_size = random.randrange(0,10)

1	word_size = random.randrange(0,10)

Then, we will create a word of length word_size and starting with the current character as shown below.

# create word starting with the current character
char_list_copy = char_list.copy()
char_list_copy.remove(char_list[i])
new_word = char_list[i]
for _ in range(word_size):
    new_word +=random.choice(char_list_copy)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

font = ImageFont.truetype(fonts+".ttf",14)

1	font = ImageFont.truetype(fonts+".ttf",14)

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

img=Image.new("RGBA", (110,20),(255,255,255))
draw = ImageDraw.Draw(img)
draw.text((5, 0),new_word,(0,0,0),font=font)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

Finally, save the image and the corresponding text file as shown below.

img.save('english/'+new_word+".png")

# Save the word in the text file
with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
    txt_file.write(new_word)

img.save('english/'+new_word+".png")

# Save the word in the text file

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below is the full code

import random
import string
import PIL
from PIL import ImageFont
from PIL import Image
from PIL import ImageDraw
from tqdm import tqdm

# create a list of characters to be used in creating dataset
char_list = []
for char in string.ascii_letters:
    char_list.append(char)
    
# create font list
font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']  

# generate images for each fonts
for fonts in font_lst:
    for i in tqdm(range(1)):
        for i in range(len(char_list)):
            # Choose a random word size
            word_size = random.randrange(0,10)
            # create word starting with the current character
            char_list_copy = char_list.copy()
            char_list_copy.remove(char_list[i])
            new_word = char_list[i]
            for _ in range(word_size):
                new_word +=random.choice(char_list_copy)
            # Draw the word on the image
            font = ImageFont.truetype(fonts+".ttf",14)
            img=Image.new("RGBA", (110,20),(255,255,255))
            draw = ImageDraw.Draw(img)
            draw.text((5, 0),new_word,(0,0,0),font=font)
            # Save the image and the corresponding text file            
            img.save('english/'+new_word+".png")

            with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:
                txt_file.write(new_word)

import random

import string

import PIL

from PIL import ImageFont

from PIL import Image

from PIL import ImageDraw

from tqdm import tqdm

# create a list of characters to be used in creating dataset

char_list = []

for char in string.ascii_letters:

char_list.append(char)

# create font list

font_lst = ['arial', 'arialbd', 'times', 'timesbd', 'timesi','ariblk', 'arialbd', 'arialbi', 'ariali', 'timesbi']

# generate images for each fonts

for fonts in font_lst:

for i in tqdm(range(1)):

for i in range(len(char_list)):

# Choose a random word size

word_size = random.randrange(0,10)

# create word starting with the current character

char_list_copy = char_list.copy()

char_list_copy.remove(char_list[i])

new_word = char_list[i]

for _ in range(word_size):

new_word +=random.choice(char_list_copy)

# Draw the word on the image

font = ImageFont.truetype(fonts+".ttf",14)

img=Image.new("RGBA", (110,20),(255,255,255))

draw = ImageDraw.Draw(img)

draw.text((5, 0),new_word,(0,0,0),font=font)

# Save the image and the corresponding text file

img.save('english/'+new_word+".png")

with open('english/'+new_word+'.txt', 'w', encoding = 'utf8') as txt_file:

txt_file.write(new_word)

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

0 Shares

TheAILearner

Mastering Artificial Intelligence

Text Recognition Datasets

Open Source Datasets

Synthetic Data

Leave a Reply Cancel reply