Text Recognition Datasets

In the previous blog, we build our own Text Recognition system from scratch using the very famous CNN+RNN+CTC based approach. As you might remember, we got pretty decent results. In order to further fine-tune our model, one thing we can do is more training. But for that, we need more training data. So, in this blog, let’s discuss some of the open-source text recognition datasets available and how to create synthetic data for text recognition. Let’s get started.

Open Source Datasets

Below are some of the open source text recognition datasets available.

  • The ICDAR datasets: ICDAR stands for International Conference for Document Analysis and Recognition. This is held every 2 years. They brought about a series of scene text datasets that have shaped the research community. For instance, ICDAR-2013 and ICDAR-2015 datasets.
  • MJSynth Dataset: This synthetic word dataset is provided by the Visual Geometry Group, University of Oxford. This dataset consists of synthetically generated 9 million images covering 90k English words and includes the training, validation, and test splits used in our work.
  • IIIT 5K-word dataset: This is one of the most challenging and largest recognition datasets available. The dataset contains 5000 cropped word images from Scene Texts and born-digital images. They also provide a lexicon of more than 0.5 million dictionary words with this dataset.
  • The Street View House Numbers (SVHN) Dataset: This dataset contains cropped images of house numbers in natural scenes collected from Google View images. This dataset is usually used in digit recognition. You can also use MNIST handwritten dataset.

Note: For more details on the Optical Character Recognition , please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Synthetic Data

Similar to text detection, when it comes to data, the text recognition task is also not so rich. Thus, in order to further train or fine-tune the model, synthetic data can help. So, let’s discuss how to create synthetic data containing different fonts using Python. Here, we will use the famous PIL library. Let’s first import the libraries that will be used.

Then, we will create a list of characters that will be used in creating the dataset. This can be easily done using the string library as shown below.

Similarly, create a list of fonts that you want to use. Here, I have used 10 different types of fonts as shown below.

Now, we will generate images corresponding to each font. Here, for each font, for each character in the char list, we will generate words. For this, first we choose a random word size as shown below.

Then, we will create a word of length word_size and starting with the current character as shown below.

Now, we need to draw that word on to the image. For that, first we will create a font object for a font of the given size. Here, I’ve used a font size of 14.

Now, we will create a new image of size (110,20) with white color (255,255,255). Then we will create a drawing context and draw the text at (5,0) with black color(0,0,0) as shown below.

Finally, save the image and the corresponding text file as shown below.

Below is the full code

Below are some of the generated images shown.

To make it more realistic and challenging, you can add some geometric transformations (such as rotation, skewness, etc), or add some noise or even change the background color.

Now, using any above datasets, we can further fine-tune our recognition model. That’s all for this blog. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Leave a Reply