Tag Archives: Dataset for OCR

Optical Character Recognition Pipeline: Generating Dataset

The first step to create any deep learning model is to generate the dataset. In continuation of our optical character recognition pipeline, in this blog, we will see how we can get our training and test data.

In our OCR pipeline first, we need to get data for both segmentation and recognition(text). For the segmentation part, data will consist of images and corresponding files containing coordinates for words present in the image. Let’s see an example.

For recognition part data will consist of images and their corresponding text files. Here segmented images will contain a single word.

Image and Text

Open Source Dataset:

There are some open source dataset available for our pipeline. For the segmentation part here are some useful open source datasets.

Now let’s see some of the open source dataset for text recognition(images and their corresponding texts)

Synthetic Data:

In some cases, training your OCR model with synthetic data can also be useful. You can create your own synthetic data using some python script. You can also add some geometric transformation to simulate the real world distortion into the data. For an example here is a script to generate synthetic data for text recognition:

In the above code, I have generated English words images and corresponding text files using different font types with a font size of 14. Segmented images will look like below:

Five Segmented Images generated from above code

Annotation Tools and Manual Data:

Another way to create segmentation text dataset is by using annotation tools. In this case, you need to collect images manually or you can get images from the internet, then you need to manually annotate text in the images (Bounding Boxes). Annotation tools like labelimg can work in this case.

That’s all to generate the dataset. In the next blog, we will see image preprocessing steps to apply to these datasets. Hope you enjoy reading.

Next Blog: Optical Character Recognition Pipeline: Image Preprocessing

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.