Category Archives: Computer Vision

Weight Pruning in Neural Networks

Weight pruning is a technique used to reduce the size of a neural network by removing certain weights, typically those with small magnitudes, without significantly affecting the model’s performance. The idea is to identify and eliminate connections in the network that contribute less to the overall computation. This process helps in reducing the memory footprint and computational requirements during both training and inference.

Initial Model: Let’s consider a simple fully connected neural network with one hidden layer. The architecture might look like this:
Input layer (features) -> Hidden layer -> Output layer (predictions)

Training: The network is trained on a dataset to learn the mapping from inputs to outputs. During training, weights are adjusted through optimization algorithms like gradient descent to minimize the loss function.

Pruning: After training, weight pruning involves identifying and removing certain weights. A common criterion for pruning is to set a threshold, and weights whose absolute values fall below this threshold are pruned. For example, let’s say we have a weight matrix connecting the input layer to the hidden layer:

If we set a pruning threshold of 0.2, weights smaller than 0.2 in absolute value may be pruned. After pruning, the weight matrix might become:

Here, the connections with weights below 0.2 are pruned by setting those weights to zero.

Fine-tuning: Optionally, the pruned model can undergo fine-tuning to recover any loss in accuracy caused by pruning. During fine-tuning, the remaining weights may be adjusted to compensate for the pruned connections.

Weight pruning is an effective method for model compression, reducing the number of parameters in a neural network and making it more efficient for deployment in resource-constrained environments.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Style Generative Adversarial Network (StyleGAN)

Generative adversarial network( GAN ) generates synthetic images that are indistinguishable from authentic images. A GAN network consists of a generator network and a discriminator network. Generator network tries to generate new images from a noise vector and discriminator network discriminate these generated images from the original dataset. While training the GAN model, the generator network tries to fool the discriminator and discriminator to improve itself to differentiate between real and fake images. This training will continue until the discriminator model is fooled half the time and the generator is not able to generate data similar to original data distribution.

Since the introduction of generative adversarial networks in 2014, there has been many improvements in its architecture. Deep convolutional GAN, semisupervised GAN, conditional GAN, CycleGAN and many more. These variants of GAN mainly focuses on improving the discriminator architecture and the generator model continues to operate as the black box.

The style generative adversarial network proposed an alternative generator architecture that can control the specific features of the output image such as pose, identity, hairs, freckles( when trained on face dataset ) even without compromising the image quality.

Baseline Architecture

The baseline architecture for StyleGAN is taken from another recently introduced GAN variant: Progressive GAN. In progressive GAN, both generator and discriminator grow progressively: starting from low resolution, It adds up layers to the model which can extract very fine details. In progressive GAN images start from 4×4 and generate images up to 1024×1024 size. This progressively growing architecture speeds up and stabilizes the training process which helps in generating such high-quality images.

StyleGAN Architecture

Progressive GAN was able to generate high-quality images but to control the specific features of the generated image was difficult with its architecture. To control the features of the output image some changes were made into Progressive GAN’s generator architecture and StyleGAN was created. Here is the architecture of the generator for the StyleGAN.

Along with the generator’s architecture, the above figure also differentiates between a traditional generator network and a Style-based generator network. To develop StyleGAN’s generator network, there are some modifications done in the progressive GAN. We will discuss these modifications one by one.

1. Removal of Traditional Input Layer

In traditional generator networks, a latent vector is provided through an input layer. This latent vector must follow the probability density of training data which ca leads to some degree of entanglement. Let’s say if training data consist of one type of image greater than other variations, then it can lead to producing images with features more related to that large type of data. So instead of a traditional input layer, the synthesis network( generator network) starts with a 4 × 4 × 512 constant tensor.

2. Mapping Network and AdaIN

Mapping network embeds the input latent code to intermediate latent space which can be used as style and incorporated at each block of synthesis network. As you can see in the above generator’s architecture, latent code is fed to 8 fully connected layers and an intermediate latent space W is generated.

This intermediate latent space W is passed through a convolutional layer “A” (shown in the architecture) and specializes in styles ( y = ( ys , yb )) to transform and incorporate into each block of the generator network. To incorporate this into each block of the generator network, first, the feature maps (xi) from each block are normalized separately and then scaled and biased using corresponding styles. This is also known as adaptive instance normalization (AdaIN).

This AdaIN operation is added to each block of the generator network which helps in deciding the features in the output layer.

3. Bilinear Upsampling

This generator network grows progressively. Usually upsampling in a generator network one uses transposed convolutional network. But here in StyleGAN, it uses bilinear upsampling to upsample the image instead of using the transposed convolution layer.

4. Noise Layers

As you can see in the architecture of the StyleGAN, noise layers are added after each block of the generator network( synthesis network ). This noise consists of uncorrelated Gaussian noise which is first broadcasted using a layer “B” to the shape of feature maps from each convolutional block. Using this addition of noise StyleGAN can add stochastic variations to the output.

There are many stochastic features in the human face like hairs, stubbles, freckles, or skin pores. In traditional generator, there was only a single source of noise vector to add these stochastic variations to the output which was not quite fruitful. But with adding noise at each block of synthesis network in the generator architecture make sure that it only affects stochastic aspects of the face.

5. Style Mixing

This is basically a regularization technique. During training, images are generated using two latent codes. It means two latent codes z1 and z2 are taken to produce w1 and w2 styles using a mapping network. In the Synthesis network a split point is selected and w1 style is applied up to that point and w2 style is applied after that point and the network is trained in this way.

In the synthesis network, these styles are added at each block. Due to this network can assume that these adjacent styles are correlated. But style mixing can prevent the network from assuming these adjacent styles are correlated.

Source

These were the basic changes made in baseline architecture to improve it and create a StyleGAN architecture. Other things like, generator architecture, mini-batch sizes, Adam hyperparameters and moving an exponential average of the generator are the same as baseline architecture.

Summary

StyleGAN has proven to be promising at producing high-quality realistic images also gives control to generate images with particular features. It was clearly seen that traditional generators lag far behind than this improved generator network. Concepts like mapping network and AdaIN can really be very helpful in GAN architecture and other research work.

Referenced Research Paper: 1. A Style-Based Generator Architecture for Generative Adversarial Networks 2. Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of CycleGAN for Image-to-image Translation

CycleGAN is a variant of a generative adversarial network and was introduced to perform image translation from domain X to domain Y without using a paired set of training examples. In the previous blog, I have already described CycleGAN in detail. In this blog, we will implement CycleGAN to translate apple images to orange images and vice-versa with the help of Keras library. Here are some recommended blogs that you should refer before implementing CycleGAN:

  1. Cycle-Consistent Generative Adversarial Networks (CycleGAN)
  2. Image to Image Translation Using Conditional GAN
  3. Implementation of Image-to-image translation using conditional GAN

Load the Dataset And Preprocess

CycleGAN does not require any paired dataset as compared to other image translation algorithms. Hence here we will use two sets of datasets. One consists of apple images and the other consists of orange images. Both the datasets are not paired with each other. Here are some images from the dataset:

You can download the dataset from this link. Or run the following command from your terminal.

Dataset consists of four folders: trainA, trainB, testA, and testB. ‘A’ dataset consists of apple images and the ‘B’ dataset consist of orange images. Training set consists of approx 1000 images for each type and the test set consists of approx 200 images corresponding to each type.

So, let’s first import all the required libraries:

Dataset is a little preprocessed as it contains all images of equal size (256, 256, 3). Other preprocessing steps that we are going to use are normalization and random flipping. Here we are normalizing every image between -1 to 1 and randomly flipping horizontally. Here is the code:

Now load the training images from the directory into a list.

Build the Generator

The network architecture that I have used is very similar to the architecture used in image-to-image translation with conditional GAN. The major difference is the loss function. In CycleGAN two more losses have been introduced. One is cycle consistency loss and the other is identity loss.

Here generator network is a U-net architecture. This U-net architecture consists of the encoder-decoder model with a skip connection between encoder and decoder. Here we will use two generator networks. One will translate from apple to orange (G: X -> Y) and the other will translate from orange to apple (F: Y -> X). Each generator network is consists of encoder and decoder. Each encoder block is consist of three layers (Conv -> BatchNorm -> Leakyrelu). And each block in decoder network is consist of four layers (Transposed Conv -> BatchNorm -> Dropout -> Relu). The generator will take an image as input and outputs a generated image. Both images will have a size of (256, 256, 3). Here is the code:

Build the Discriminator

Discriminator network is a patchGAN pretty similar to the one used in the code for image-to-image translation with conditional GAN. Here two discriminators will be used. One discriminator will discriminate between images generated by generator A and orange images. And another discriminator is used to discriminate between image generated by generator B and apple images.

This patchGAN is nothing but a convolution network. The difference between patchGAN and normal convolution network is that instead of producing output as single scalar vector it generates an NxN array. This NxN array maps to the patch from the input images. And then takes an average to classify the whole image as real or fake.

Combined Network

Now we will create a combined network to train the generator model. Here both discriminators will be non-trainable. To train the generator network we will also use cycle consistency loss and identity loss.

Cycle consistency says that if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. To calculate the cycle consistency loss first pass the input image A to generator A and then pass the predicted output to the generator B. Now calculate the loss between image generated from generator B and input image B. Same goes while taking image B as input to the generator B.

In case of identity loss, If we are passing image from domain A to generator A and trying to generate image looking similar to image from domain B then identity loss makes sure that even if we pass image from domain B to generator A it should generate image from domain B. Here is the code for combined model.

Loss, Optimizer and Compile the Models

Here we are using mse loss for the discriminator networks and mae loss for the generator network. Optimizer use here is Adam. The batch size for the network is 1 and the total number of epochs is 200.

Train the Network

  1. Generate image from generator A using image from domain A, Similarly generate an image from generator B using image from domain B.
  2. Train discriminator A on batch using images from domain A and images generated from generator B as real and fake image respectively.
  3. Train discriminator B on batch using images from domain B and images generated from generator A as real and fake image respectively.
  4. Train generator on batch using the combined model.
  5. Repeat steps from 1 to 4 for every image in the training dataset and then repeat this process for 200 epochs.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementation of Image-to-image translation using conditional GAN

In the previous blog, we have learned what is an image-to-image translation. Also, we discussed how it can be performed using conditional GAN. Conditional GAN is a type of generative adversarial network where discriminator and generator networks are conditioned on some sort of auxiliary information. In image-to-image translation using conditional GAN, we take an image as a piece of auxiliary information. With the help of this information, the generator tries to generate a new image. Let’s say we want to translate the edge image of a shoe to a real looking image of a shoe. Here we can condition our GAN with the edge image.

To know more about conditional GAN and its implementation from scratch, you can read these blog:

  1. Conditional Generative Adversarial Networks (CGAN): Introduction and Implementation
  2. Image to Image Translation Using Conditional GAN

Next, in this blog, we will implement image-to-image translation from scratch using Keras functional API.

Dataset and Preprocessing

To implement an image-to-image translation model using conditional GAN, we need a paired dataset as shown in the below image.

Center for Machine Perception (CMP) at the Czech Technical University in Prague provides rich source of the paired dataset for image-to-image translation which we can use here for our model. In this blog, we will use edges to shoe dataset provided by this link. This dataset consists of a train and validation set. The training set is consist of 49825 images and validation set is consist of 200 images. This dataset consist of some preprocessed images which contains edge and shoe in a single image as shown below:

These images have the size of (256, 512, 3) where 256 is the height, 512 is the width and the number of channels is 3. Now to bifurcate this image into input and output image, we can just slice this image from mid. After segregating we also need to normalize the image. These images consist of values b/w 0 to 255 and to make training faster and reducing the chances of getting stuck in local minima we need to normalize these images. we will normalize these images between -1 to 1. Here is the code to preprocess the image.

In the preprocessing step we have only used the normalization technique. To preprocess the images we can also do some random jittering and random mirroring as mentioned in the paper. To perform random jittering you just need to upscale the image to 286×286 and then randomly crop to 256×256. To perform random mirroring you need to flip the image horizontally.

Generator Network

Generator network for this conditional GAN architecture is a modified U-net architecture. This U-net architecture consists of an encoder-decoder network with skip connections between encoder and decoder. Each encoder block is consist of three layers (Conv -> BatchNorm -> Leakyrelu). Downsampling in the encoder layer is performed using the strided convolutional layers. Each block in decoder network is consist of four layers (Transposed Conv -> BatchNorm -> Dropout -> Relu). Dropout is only applied for the first three blocks in the decoder network. The input shape for the network is (256, 256, 3). Output shape is also (256, 256, 3) which will be a generated image.

Normally in a generative adversarial network, input to a generator is a noise vector. But here we will use a combination of noise vector and edge image as input to the generator. We will take a noise vector of size 100 and then use a dense layer and then reshape it to concatenate with image input. Here is the code for the generator network. The model looks a little lengthy but don’t worry these are just repeated U-net blocks for encoder and decoder.

Discriminator Network

Here discriminator is a patchGAN. A patchGAN is basically a convolutional network where the input image is mapped to an NxN array instead of a single scalar vector. For this conditional GAN, the discriminator takes two inputs. One is edge image and the other is the shoe image. Both inputs are of shape 9256, 256, 3). The output shape of this network is (30, 30, 1). Here each 30×30 output patch classifies the 70×70 portion of the input image.

Here each block in the discriminator is consist of 3 layers (Conv -> BatchNorm -> LeakyRelu). I have used the Gaussian Blurring layer to reduce the dominance of discriminator while training. Here is the full code.

Combined Network

Now we will create a combined network to train the generator model. Firstly this network takes noise vector and edge image as input and generates a new image using a generator network. Now the output from the generator network and edge image is fed to the discriminator network to get the output. But here discriminator will be non-trainable. Here is the network code.

Training

I have used binary cross-entropy loss for the discriminator network. For the generator network, I have coupled the binary cross-entropy loss with mae loss. This is because, for image-to-image translation, the generator’s duty is not only to fool the discriminator but also to generate real-looking images. I have used Adam optimizer for both generator and discriminator but the only difference is that I have kept a low learning rate for the discriminator to make it less dominant while training. I have used a batch size of 1. Here are the steps to train the explained conditional GAN.

  1. Train the discriminator model with real output images with patch labels of values 1.
  2. Train the discriminator model with images generated from a generator with patch labels of values 0.
  3. Train the generator network using the combined model.
  4. Repeat the steps from 1 to 3 for each image in the training dataset and then repeat all this for some number of epochs.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Cycle-Consistent Generative Adversarial Networks (CycleGAN)

In this blog, we will learn how to perform an image-to-image translation using CycleGAN. The image-to-image translation is a type of computer vision problem where the image is transformed from one domain to another domain. Let say edges to a photo.

An image-to-image translation generally requires a paired set of images to train a model. We can see this type of translation using conditional GANs. In those cases paired set of images is required. Take a look into paired set of images for translating edges to photo:

But for many cases, collecting paired set of training data is quite difficult. Let say we want an object transfiguration model where we want to translate an image of a horse to an image of zebra and vice versa.

For these types of tasks, even the desired output is not well defined then how we can collect a paired set of images. To solve this problem authors have proposed an approach called CycleGAN to transfer an image from X domain to Y domain without paired set of examples.

Cycle Consistent GAN

A CycleGAN captures special characteristics of one image domain and figures out how these image characteristics could be translated to another image domain, all without paired training examples. Let’s look at some unpaired training dataset.

Problem with these translations: In the case of paired training examples, the network has supervision power with corresponding label images. But in the case of the unpaired training dataset, we need to supervise at a set level where sets are X domain and Y domain. Now to train such network we need to find a mapping G: X → Y such that outputs from G(X) are indistinguishable from the Y domain. The possibility of such G mappings is infinite which does not guarantee meaningful input and output image pairs. Sometimes this type of network causes mode collapse. Mode collapse occurs when all input images map to the same output image.

Cycle Consistent: To cop up with the problem stated above the authors of the paper proposed that translation should be “Cycle Consistent”. For example, if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. Similarly, in case of image if we translate image from X domain to Y domain using a mapping G and then again translate this G(X) to X using mapping F we should arrive back at the same image.

So here, CycleGAN consists of two GAN network. Both of which have a generator and a discriminator network. To train the network it has two adversarial losses and one cycle consistency loss. Let’s see its mathematical formulation.

Mathematical Formulation of CycleGAN

Let say we are having two image domains X and Y. Now our model includes two mappings G: X → Y and F: Y → X. And we are having two adversarial losses DX and DY. DX will discriminate between F(Y) and X domain images. Similarly, DY will discriminate between G(X) and Y domain images. We will also have a cycle consistency loss to prevent a contradiction between learned mapping G and F.

In above figure (a), you can see the two different mappings G and F. Also figure (b) and (c) defines the forward cycle consistency loss ( x → G(x) → F(G(x)) ≈ x ) and backward consistency loss ( y → F(y) → G(F(y)) ≈ y ) respectively.

Network Architecture

There are two different architectures each for generator and discriminator network.

Generator network follows encoder-decoder architecture with three main parts:

  1. Encoder
  2. Transformer
  3. Decoder
Source

The encoder consists of three convolutional layers. An input image is passed through this encoder network and features volumes are taken as output. The transformer consists of 6 residual blocks. It takes feature volumes generated from the encoder layer as input and gives the output. And finally, the decoder layer which works as deconvolutional layers. It takes output from the transformer and generates a new image.

A Discriminator network is a simple network. It takes image as input and predicts whether it is part of real dataset or fake generated image dataset.

Source

This discriminator network is basically a patchGAN. A patchGAN is a simple convolutional network whereas the only difference is instead of mapping the input image to single scalar output, it maps input image to an NxN array output. Every individual in NxN output maps to a patch in the input image. In cycleGAN, it maps to 70×70 patches of the image. Finally, we take the mean of this output and optimize it to find the real of fake image. The advantage of using a patchGAN over a normal GAN discriminator is, it has fewer parameters than normal discriminator also it can work with arbitrary sized images.

Loss Function

Adversarial loss is applied to both mapping G and F with adversarial losses as DX and DY. These discriminator losses makes sure that the model is trained to generate data indistinguishable from real data for both image domains.

Adversarial losses alone can not guarantee that learned function can map individual input x to desired output y. Thus we need to use cycle consistency loss also. Cycle consistency loss makes sure that the image translation cycle is able to bring back x to the original image, i.e., x → G(x) → F(G(x)) ≈ x. Now full loss can be written as follows:

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F)

First, two arguments in the loss function are adversarial losses for both mappings. The last parameter is for cycle consistency loss. λ here defines the importance of the respective loss. Originally authors have used it as 10.

CycleGAN has produced compelling results in many cases but it also has some limitations. That’s all for CycleGAN introduction. In the next blog we will implement this algorithm in keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Image to Image Translation Using Conditional GAN

The image-to-image translation is a well-known problem in the field of image processing, computer graphics, and computer vision. Some of the problems are converting labels to street scenes, labels to facades, black&white to a color photo, aerial images to maps, day to night and edges to photo. Take a look into these conversions:

Earlier each of these tasks is performed separately. But with the help of convolutional neural networks (CNNs), communities are taking big steps in this field. Because of CNN, most of the work is automatic as we train the model in an end to end fashion. But still, we need to define a loss function that tries to achieve the target we want. Most of us take the loss function lightly but this is the most important thing that you should always give your attention to when training deep learning models. For instance, if we take euclidean distance as our loss function for image-to-image translation, it would produce blurred images because it minimizes by averaging all outputs. Thus we need a meaningful loss function corresponding to each task and this is something that is always painful. This is where the generative adversarial network (GAN) comes.

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions.

Now with the help of GANs, we can generate a realistic-looking image. But in image-to-image translation, we do not just want to generate a realistic-looking image but also output image should be translated from the input image. To perform this type of task we need a conditional GAN, so you must first understand this before moving forward (To know in detail about conditional GAN you can follow this blog).

In image-to-image translation with conditional GAN, the generator is provided with the input image and a noise vector both. Now generator will generate an image that is translated from the input image and indistinguishable from original data (Discriminator will be fooled). To train this model we need some paired training examples as shown below:

Network Architecture

Here the network architecture consists of two models, generator and discriminator. First, take a look into the generator model.

Generally, a generator network in GAN architecture takes noise vector as input and generates an image as output. But here input consists of both noise vector and an image. So the network will be taking image as input and producing an image as output. In these types of problems generally, an encoder-decoder model is being used.

In an encoder-decoder network, first, the input is being down-sampled till a bottleneck layer and then upsampled to generate image again. In our problem of image-to-image translation, input and output differ in surface appearance but both have the same structure. So to make this encoder-decoder network-rich, the low-level information is shared between the input and output. For this, skip connections are added which forms an U-net architecture as shown in the above figure.

Here the discriminator model is a patchGAN. A patchGAN is nothing but a conv net. The only difference is that instead of mapping an input image to a single scalar vector, it maps to an NxN array. Where each individual element in NxN array maps to a patch in the input image. Finally, averaging is done to find the full input image is real or fake.

Reason for using patchGAN: The generator model is being trained using discriminator loss and also the L1 loss. It is well known that L1 losses produce blurry images. L1 losses fail to capture high frequencies in images while in many cases they are able to capture low frequencies. Now the task for discriminator will be only to capture high frequency. By straining the model’s attention to local image patches using patchGAN, it clearly helped in capturing high frequencies in the image.

Loss Function

Generally, loss function for a conditional GAN can be stated as follows:

Here generator G tries to minimize this loss function whereas discriminator D tries to maximize it. In the paper, authors have coupled it with L1 loss function such that the generator task is to not only fool the discriminator but also to generate ground truth near looking images. So final loos function would be:

Paper has suggested that this is a really promising approach in many image-to-image translation tasks but it always requires a paired training dataset which is sometimes difficult to get. That’s all for this blog, in the next blog we will implement its application (pix2pix) using keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Image-to-Image Translation with Conditional Adversarial Networks

Implementation of Efficient and Accurate Scene Text Detector (EAST)

In the previous blog, we discussed the EAST algorithm, its architecture and its usage. In this blog, we will see how to implement the EAST using its GitHub Repository We will do this implementation in a Linux system.

Clone the Repository

First, you need to clone its GitHub repository on your system and change your directory to the EAST folder by using the following command.

Download Pretrained Checkpoints

Now to test this EAST model, you first need to download the pretrained checkpoints trained on ICDAR 2013 and ICDAR 2015 dataset. You can download the checkpoints from the following link:

Google Drive Link

Test the Model

After downloading pretrained checkpoints and cloning the GitHub repository, you are ready to test the model using the following command:

In the above command, you need to specify some directory paths. First, you need to specify your test image dataset path as a “test_data_path” argument. Second, you need to specify your recently downloaded checkpoints path as a “checkpoint_path” argument. And lastly, you need to specify your output directory path as an “output_dir” argument.

Sometimes you may end up with common adaptor and lanms error as shown in the following figure.

To solve these errors, you just need to use the following links or you can just google them.

  1. can not compile lanms
  2. running eval.py; undefined symbol: _Py_ZeroStruct

Running EAST using WEB

We can also run a demo by using the run_demo_server.py file provided by the GitHub repository. We just need to run the following command:

As you can see demo server is running on default port number 8769. Now you just need to open your web browser and submit the following URL:

http://localhost:8769/

Then upload the image and click on the submit button. After processing, you will see the results something like this.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced GitHub Repository: EAST: An Efficient and Accurate Scene Text Detector

Efficient and Accurate Scene Text Detector (EAST)

Before the introduction of deep learning in the field of text detection, it was difficult for most text segmentation approaches to perform on challenging scenarios. Conventional approaches use manually designed features while deep learning methods learn effective features from training data. These conventional approaches are usually multi-staged which ends with slightly lesser overall performance. In this blog, we will learn a deep learning-based algorithm (EAST) that detects text with a single neural network with the elimination of multi-stage approaches.

Introduction

The EAST algorithm uses a single neural network to predict a word or line-level text. It can detect text in arbitrary orientation with quadrilateral shapes. In 2017 this algorithm outperformed state of the art methods. This algorithm consists of a fully convolutional network with a non-max suppression (NMS) merging state. The fully convolutional network is used to localize text in the image and this NMS stage is basically used to merge many imprecise detected text boxes into a single bounding box for every text region (word or line text).

EAST Network Architecture

The EAST architecture was created while taking different sizes of word regions into account. The idea was to detect large word regions that require features from the later stage of the neural network while detecting small word regions that require low-level features from initial stages. To create this network, authors have used three branches combining into a single neural network.

EAST

1. Feature Extractor Stem

This branch of the network is used to extract features from different layers of the network. This stem can be a convolutional network pretrained on the ImageNet dataset. Authors of EAST architecture used PVANet and VGG16 both for the experiment. In this blog, we will see EAST architecture with the VGG16 network only. Let’s see the architecture of the VGG16 model.

VGG16

For the stem of architecture, it takes the output from the VGG16 model after pool2, pool3, pool4, and pool5 layers.

2. Feature Merging Branch

In this branch of the EAST network, it merges the feature outputs from a different layer of the VGG16 network. The input image is passed through the VGG16 model and outputs from different four layers of VGG16 are taken. Merging these feature maps will be computationally expensive. That’s why EAST uses a U-net architecture to merge feature maps gradually (see EAST architecture figure). Firstly, outputs after the pool5 layer are upsampled using a deconvolutional layer. Now the size of features after this layer would be equal to outputs from the pool4 layer and both are then merged into one layer. Then Conv 1×1 and Conv 3×3 are applied to fuse the information and produce the output of this merging stage.

Similarly outputs from other layers of the VGG16 model are concatenated and finally, a Conv 3×3 layer is applied to produce the final feature map layer before the output layer.

3. Output Layer

The output layer consists of a score map and a geometry map. The score map tells us the probability of text in that region while the geometry map defines the boundary of the text box. This geometry map can be either a rotated box or quadrangle. A rotated box consists of top-left coordinate, width, height and rotation angle for the text box. While quadrangle consists of all four coordinates of a rectangle.

Note: For more details on the Optical Character Recognition, please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Loss Function

The loss function used in this EAST algorithm consists of both score map loss and geometry loss function.

As you can see in the above formula, both losses are combined with a weight λ. This λ is for giving importance to different losses. In the EAST paper, authors have used it as 1.

Non-max Suppression Merging Stage

Predicted geometries after fully convolutional network are passed through a threshold value. After this thresholding, remaining geometries are suppressed using a locality aware NMS. A Naive NMS runs in O(n2). But to run this in O(n), authors adopted a method which uses suppression row by row. This row by row suppression also takes into account iteratively merging of the last merged one. This makes this algorithm fast in most of the cases but the worst time complexity is still O(n2).

This was all about the Efficient and Accurate Scene Text algorithm. In the next blog, we will implement this algorithm using its GitHub Repository. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: EAST: An Efficient and Accurate Scene Text Detector

Implementation of Connectionist Text Proposal Network (CTPN)

In the previous blog we have learnt about CTPN algorithm and its architecture in detail. In this blog we will implement this algorithm using its GitHub repository to localize text in an image. We will use Linux operating system to do this.

Clone the Repository

Open a terminal window and clone the CTPN GitHub Repo using following command:

Build the Required Library

Non max suppression (NMS) and bounding box (bbox) utilities are written in cython. We need to generate .so file for these so that required files can be loaded into the library. We first need to change current directory to “/text-detection-ctpn/utils/bbox” using following commands:

Now run the following commands to build the library.

These commands will generate nms.so and bbox.so in the current directory.

Test the model

Now we can test the CTPN model. To test the model we first need to download the checkpoints. These checkpoints are already provided in the GitHub repository to test the model. You can download the checkpoints from google drive. Now use following steps:

  1. Unzip the downloaded checkpoints.
  2. Place the unzipped folder “checkpoints_mlt” in directory ” /text-detection-ctpn”.
  3. Put your testing images in /data/demo/ folder and your outputs will be generated in /data/res folder.
  4. Your folder structure will look like follows.

Now run the following command from terminal to test your input images. Change your directory to ” “/text-detection-ctpn” first.

Your output must have been generated on data/res folder. Some of the input and results are shown below.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Referenced GitHub Code: text-detection-ctpn

Connectionist Text Proposal Network (CTPN)

Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.

Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.

Introduction

CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.

There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.

To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.

The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.

Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.

CTPN Algorithm

Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

VGG Network

CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.

3×3 Spatial Window and Recurrent Layer

A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.

Output Layer

The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.

A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.

The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.

The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.

Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network