Author Archives: kang & atul

Connectionist Text Proposal Network (CTPN)

Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.

Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.

Introduction

CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.

There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.

To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.

The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.

Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.

CTPN Algorithm

Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

VGG Network

CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.

3×3 Spatial Window and Recurrent Layer

A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.

Output Layer

The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.

A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.

The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.

The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.

Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

Creating a Deep Convolutional Generative Adversarial Networks (DCGAN)

In this tutorial, we will learn how to generate images of handwritten digits using the deep convolutional generative adversarial network.

What are GANs?

GANs are one of the most interesting ideas in deep learning today. In GANs two networks work adversarially. One is generator network which tries to generate new images which looks similar to original image dataset. Another is discriminator network which discriminates between real images (images from the dataset) and fake images (images generated from generator network).

During training, generator progressively becomes better at generating images that can not be distinguishable from real images and discriminator become more accurate at discriminating them. Training gets completed when discriminator can no longer discriminate between images generated by generator and real images.

I would recommend you to go through this blog to learn more about generative adversarial networks. Now we will implement Deep convolutional adversarial Networks using MNIST handwritten digits dataset.

Import All Libraries

Initialization

Generator Network

Generator network takes random noise as input and generates meaningful images which looks similar to real images. Inputs have a shape of vector size 100. Output images have shape of (28, 28, 1) which is same as images shape in MNIST dataset.

In generator network we use deconvolutional layers to upsample the input to image size. In convolutional layers network tries to extract some useful features while in deconvolutional layers, the network tries to add some interesting features to upsample an image. To know more about deconvolution you can read this blog. I have also added batch normalization layers to improve the quality of model and stabilizing the training process. For this network, I have used cross-entropy loss and Adam optimizer. Here is the code.

Discriminator Network

Discriminator network discriminates between real and fake images. So it is a binary classification network. This network consists of

  1. the input layer of shape (28, 28, 1),
  2. Three hidden layers of 16, 32 and 64 filters and
  3. the output layer of shape 1.

I have also used batch normalization layer after every conv layer to stabilize the network. To downsample, I have used average pooling instead of max pooling. Finally compiled the model with cross entropy loss and Adam optimizer. Here is the code.

Combined Model

After creating generator and discriminator network, we need to create a combined model of both to train the generator network. This combined model takes the random noise as input, generates images from generator and predict label from discriminator. The gradients generated from this are used to train the generator network. In this model, we do not train the discriminator network. Here is the code.

Training of GAN model:

To train a GAN network we first normalize the inputs between -1 and 1. Then we train this model for a large number of iterations using the following steps.

  1. Take random input data from MNIST normalized dataset of shape equal to half the batch size and train the discriminator network with label 1 (real images).
  2. Generate samples from generator network equal to half the batch size to train the discriminator network with label 0 (fake images).
  3. Generate the random noise of size equal to batch size and train the generator network using the combined model.
  4. Repeat steps from 1 to 3 for some number of iterations. Here I have used 30000 iterations.

Generating the new images from trained generator network

Now our model has been trained and we can discard the discriminator network and use the generator network to generate the new images. We will take random noise as input and generate the images. After generating the images we need to rescale them to show the outputs. Here is the code.

So, this was the implementation of DCGAN using MNIST dataset. In the next blogs we will learn other GAN variants.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Conditional Generative Adversarial Networks (CGAN): Introduction and Implementation

Generative adversarial networks (GANs) are trained to generate new images that look similar to original images. Let say we have trained a GAN network on MNIST digit dataset that consists of 0-9 handwritten digits. Now if we generate images from this trained GAN network, it will randomly generate images which can be any digit between 0 to 9. But if we want to generate images only for a particular digit, it will be difficult. One way is to find a mapping between random noise given as input to generator and images generated by the network. But with the variations in random input noise, it is really difficult to find the mapping. Here comes the conditional GANs.

A GAN network will be a conditional GAN if we train both the discriminator and generator conditioned on some sort of auxiliary information. This information can be class labels, black&white images, and other modalities. In this blog, we will learn how to generate images from a conditional GANs (cGAN) conditioned on the class label.

After the introduction of conditional GANs in 2014, there has been a wide range of applications developed based on this network. Some of them are:

  1. Image to Image Translation: With the use of cGAN there has been a various implementation of image to image translations like translation from day to night, translation from black and white to color, translation from sketches to color photographs, etc.


  2. Face Aging: Uses conditional GANs to generate face photographs with different ages, from younger to older.


  3. Text to Image: Inspired by the idea of conditional GANs, generates images given text explaining the image.


That’s enough for the introduction now we will implement a conditional GANs to generate handwritten digits conditioned on class labels.

Here we will use MNIST digits dataset to train this conditional GAN. This dataset consists of images of digits ranging from 0-9 and corresponding labels. Create a cgan.py file and insert the following code:

Line 1 imports all the required layers from keras. Line 2 and 3 imports Model and optimizer respectively. Line 4 imports required MNIST dataset from keras. If you haven’t done it earlier it will download the data first. Line 5 imports the numpy package.

As we have imported all the necessary packages, next we will create our cGAN architecture. To create this network, first, we will create a class and initialize all the necessary variables.

In the above code, Line 1 creates a class named as GAN. Line 2 defines an init function which is used to initialize all the required variables. Line 4 loads the data which consists of training and test data both with their labels. Line 5-9 initializes hyperparameters required for the network. Line 10-12 call the functions generator, the discriminator and combined model which we will later define in this class.

After initializing all the required variables we will next define the generator function of class GAN.

In generator we are taking two inputs, one is random noise of shape (100,) and another is class label of shape (1,) which will be an integer between 0-9. This extra input taken as class label will be our condition to GAN. During test time we will use this class label as a condition to generate images for that specific class only.

In the above code, Line 3-6 is for our input of class label. Here we have added Embedding layer to this conditional input which consists of weights and will be trained during the generator training. This embedding layer converts positive integers to a dense vector of fixed size. Here we have taken embedding of size 50. After this embedding layer we have added a dense layer and then reshaped it to make compatible during concatenation with random noise.

Line 8-9 creates an input layer for random noise and reshape it. Line 11 and 12 concatenate both the inputs after reshaping and then applied the batch norm. Batch normalization is really helpful in improving the quality of the model and stabilizing the training process.

Line 13-15 are for two upsampling layers (deconvolutional layers) with added batch normalization layer. Line 16 is an output layer with shape equals real images (28, 28, 1). Line 17, we create a generator model. Line-18 is for compiling the model where loss is cross-entropy and optimizer is Adam optimizer.

This GAN class is also consist of discriminator network which is also conditioned on class labels.

In the above code, line 3-6 are doing the same for converting class label input to embedding as we have seen in the case of generator network except for reshaping it to (28, 28, 1) instead of reshaping it to (7, 7, 1). Line 8 describes the second input layer which is an image (either real or fake). then in Line 10 we concatenate both the inputs to make it compatible with our discriminator network.

Line 11-19 is basically a combination of conv layer -> batch norm layer -> average pooling layer. Convolution layers are having filter size of 16, 32 and 64. Here we have used the average pooling layer instead of using max pooling layer as it is recommended to not use max pooling layers with GAN architectures.

Finally, from line 20-21 we first flatten the output from the previous layer and added a fully connected layer with shape 1 which is treated as output layer for our discriminator model. This model will discriminate between real and fake image. Line 22-23 we created discriminator model which takes two inputs with one output and then compiled the model with cross-entropy loss and Adam optimizer.

This was our discriminator model, now we will create a combined model which consists of both discriminator and generator to train the generator network.

In the above code, we created a combined model which takes two inputs one is random noise of shape (100, ) and another is the class label of shape (1, ). Generator model takes these two inputs and generates the new image which is then fed to the discriminator model to predict the output. Here, only the generator is being trained and the discriminator is made non-trainable.

Next we will train the whole GAN networks using these networks.

In the above code, from line 3-4, first, we first normalize the input image in the range of -1 to 1 and then reshape it to (28,28, 1). From line 9-11 we randomly select the real images and their corresponding labels equals to half the batch size. Line 13, we train the discriminator network using these real images conditioned on real class labels.

Then Line 15 we select the random labels between 0-9 of half the batch size for the input to the generator because during training we can not have the class labels for random noise to the generator. Then Line 16-17 we take random noise of shape (half_batch_size, 100) and generate the images from generator network which will be fake input images to the discriminator. Then Line 19 we train the discriminator network with these fake generated images which is conditioned on random class labels.

Finally, in line 21-22, we train our generator network using the combined model. Here we take the random noise and random class labels as input to the combined model.

We train this network for some number of iterations until our network is not able to fool the discriminator network. Finally, after training this network we can discard the discriminator network and use the generator network to generate new images conditioned on class labels.

Above code is used to test our trained cGAN. Here are the outputs generated from the network.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

References:

Implementation of GANs to generated Handwritten Digits

In the previous blog, we studied about GANs, now in this blog, we will implement GANs to generate MNIST digits dataset.

In the generative adversarial networks, both generator and discriminator are trained simultaneously. Both networks can overpower each other if not trained properly. If discriminator is trained more than it will easily detect fake and real image then the generator will not able to generate real-looking images. And if the generator is trained heavily then discriminator will not be able to classify between real and fake images. We can solve this problem by properly setting the learning rate for both networks.

When we train discriminator we do not train generator and when we train generator we do not train discriminator. This makes the generator to train properly. Now, let’s look into the code for each part on the GAN network.

Discriminator Network:

We are using MNIST digits dataset which is having an image shape of (28, 28, 1). Since the image size is small we can use MLP network for discriminator instead of using convolutional layers. To do this first we need to reshape input into a single vector of size (784, 1). Then I have applied three dense layers of 512, 256 and 128 hidden units in each layers.

Generator Network:

To create generator network we will first take random noise as input with the shape of (100, 1). Then I have used three hidden layers with shape of 256, 512 and 1024. The output of the generator network is then reshaped to (28, 28, 1). I have batch normalization in each hidden layer. Batch normalization improves the quality of the trained model and also stabilizes the training process.

Combined Model:

To train the generator we need to create a combined model where we do not train the discriminator model. In combined model random noise is being given as input to the generator network and the output image is then passed through the discriminator network to get the label. Here I have flagged discriminator model as non-trainable.

Training the GAN network:

Training a GAN network requires careful hyper-parameters tuning. If the model is not trained carefully it will not converge to produce good results. We will use the following steps to train this GAN network:

  1. Firstly we will normalize input dataset (MNIST images).
  2. Train the discriminator with real images (from MNIST dataset)
  3. Sample same number of noise vectors to predict the output from generator network (Generator is not trained here).
  4. Train the discriminator network with images generated in the previous step.
  5. Take new random samples to train the generator with a combined model without training discriminator.
  6. Repeat from step 2-5 for some number of iterations. I have trained it for 30000 iterations.

Take a look into the generated images from this GAN network.

Here is the full code.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

An Introduction to Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs), as the name suggests these are the deep learning models used to generate new data from the given dataset using an adversarial process. GANs were first introduced by Ian Goodfellow at NIPS 2014. This idea is regarded as the most interesting in machine learning in the last 10 years. Generative models are carrying bigger hope because they can mimic any data distribution. They can be used to generated images, audio waveform containing speech, music, etc.

Generative Adversarial Network Algorithm:

To create a GAN, we train two networks simultaneously in an adversarial manner. The two networks are generator and discriminator. And the adversary is, while the generator tries to generate data similar to original data distribution, discriminator tries to discriminate between data generated by the generator and original data. Here generator will try to fool the discriminator by improving itself and discriminator tries to differentiate between original and fake. This training will continue until the discriminator model is fooled half the time and the generator is able to generate data similar to original data distribution.

Let’s consider an example of generating new images using GAN. The first network discriminator is D(X), where X is an image (either real or fake). And the second network generator is G(Z), where Z is random noise. To train these networks D is first fed with real images and train to produce values close to 1(real) and then fed with fake images(generated by generator) and trained to produce values close to 0 (fake). Similarly, the generator is trained with loss generated by each image fed to discriminator produced by the generator.

We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1 − D(G(z))). Let’s take a look into the algorithm provided in GAN paper.

We train this network for some number of iterations to make generator predict images close to the training dataset.

Generative Adversarial Networks (GANs) Vs Variational Autoencoders (VAEs)

There are some other generative models such as variational autoencoders that can do a similar job as GANs do. A VAE model maps the input to low dimensional space and then create a probability distribution to generate new outputs using some decoder function (To know more about VAEs you can follow this blog).

VAE Model

While Vanilla GANs are not able to map the input to latent space rather they use random noise to generate new data. GANs are usually difficult to train but generate more fine and granular images while VAEs are easier to train but produces more blurred images.

This was a brief introduction about generative adversarial networks. In the following posts, we will implement different GAN architectures, train GAN network and learn more about GAN improvements with its variants (CycleGAN, InfoGAN, BigGAN, etc).

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Implementing semi-supervised Learning using GANs

Semi-supervised learning aims to make use of a large amount of unlabelled data to boost the performance of a model having less amount of labeled data. These type of models can be very useful when collecting labeled data is quite cumbersome and expensive. Several semi-supervised deep learning models have performed quite well on standard benchmarks. In this blog, we will learn how GANs can help in semi-supervised learning.

If you are new to GANs, you should first read this blog: An Introduction to Generative Adversarial Networks. Generally in GANs, we train using two networks adversely, generator and discriminator. After training the GAN network we discard the discriminator and only use generator network to generate the new data. Now in the semi-supervised model after training the network we will discard the generator model and use the discriminator model. But here the discriminator model is designed differently.

In semi-supervised GAN (SGAN) discriminator is not only trained to discriminate between real and fake data but also to predict the label for the input image. Let say we take an example of MNIST dataset. In MNIST dataset there are basically handwritten digits from 0-9, a total of 10 classes. Now in semi-supervised GAN for MNIST digits, the discriminator will be trained for real or fake images and for predicting these 10 classes also.

So in SGANs, the discriminator is trained with these three types of datasets.

  1. Fake images generated by generator network.
  2. Real images from a dataset without having any labels (a large amount of unlabeled data).
  3. Real images from the dataset with labels ( less number of the labeled dataset)

While generator in SGAN will be trained in a similar way as it is trained in vanilla GANs. This type of training will allow the model to learn useful features extracted from unlabeled dataset and use these features to train a supervised discriminator to predict the labels of the input image.

Implementing Semi-Supervised GAN

Now we will implement a semi-supervised GAN using MNIST digits dataset. If you want to implement a simple GAN you can follow this blog: Implementation of GANs to generated Handwritten Digits.

MNIST digits dataset consists of 60000 training images from which we will only use 1000 labeled images and rest as unlabeled images. We will select random 1000 labeled images containing 100 images for each class. Let’s see the code for this:

Discriminator in SGAN

For this semi-supervised GAN model, we will create two discriminator models both of them share weights of every layer but have different output layers. One model will be the binary classifier model (discriminate between real and fake images) and another will be multi-class classifier model (predicts labels for the input image). Let’s see the code for this:

Generator in SGAN

Generator in this SGAN is a simple multi-layer neural network having three hidden layers with units 512, 256 and 128. The output layer is having a shape of the original image (28, 28,1). Input to the generator will we random noise of vector size 100. Here is the code.

Training the model

Training this model will consist of the following steps:

  1. Sample both label and unlabeled data from the MNIST dataset, also normalize and make labels of data into categorical form.
  2. Train the multi-class discriminator model with labeled real images (take a batch from images)
  3. Train the binary-class discriminator model with unlabeled real images (take a batch from images)
  4. Sample noise of vector size 100 and train the binary-class discriminator model with fake images generated by generator network.
  5. Sample noise of vector size 100 and train the combined model to train the generator network.
  6. Repeat steps from 2-5 for some number of iterations. I have trained it for 10000 iterations.

In the above training steps, you can see that we are training multi-class discriminator and binary-class discriminator in different steps. But actually they are sharing weights of the same network except for the output layer (As I have mentioned earlier).

Also, Binary-class discriminator is trained two times in every iteration, one with real images taken from the dataset and another with fake images generated from the generative network. While multi-class discriminator is trained once in each iteration, only with real labeled images. This is because multi-class labels are not available for generated images.

I have also tested the SGAN model with 10000 test dataset provided by MNIST after every 1000 iteration. Here is the result of that.

Now you can see that I have trained this SGAN model with only 1000 labeled images and it gives an accuracy of about 94.8%, that is quite nice.

Give me the full code!

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Contour Tracing

In the previous blogs, we discussed various image segmentation methods which result in partitioning the image into sub-regions. Now, the next task is to represent and describe these regions in a form suitable further image processing tasks such as pattern classification or recognition, etc. One can represent these regions either in terms of the boundary (external feature) or in terms of the pixels comprising the regions (internal feature). So, in this blog, we will discuss one such representation known as Contours.

Contours in simple terms is a curve joining all the continuous points (along the boundary), having some similar property such as intensity. Once the contours are extracted, we can use them for shape analysis, and various object detection and recognition tasks, etc. So, let’s discuss different contour tracing (i.e. detecting the boundary of a region) algorithms. Some of the most common algorithms are

Square Tracing algorithm

This was one of the first approaches to extract contours and is quite simple. Suppose background is black (0’s) and object is white (1’s). Start iterating over the binary or segmented image row by row starting from left to right. If you detect white pixel (i.e. 1) go left otherwise go right. Here, left and right direction is subjective to how you entered that pixel. Stopping condition is if you entered the starting pixel a second time in the same manner you entered it initially. This works best with 4-connectivity as it only checks left and right and misses diagonal directions.

Moore Boundary Tracing algorithm

Start iterating row by row from left to right. Then traverse the 8-connected components of the object pixel found in the clockwise direction from the background pixel just before the object pixel. Stopping criteria is same as above. This removes the above method limitations.

Radial Sweep

This is similar to the Moore algorithm. After performing the first step of Moore algorithm, draw a line segment connecting the two object pixels found. Rotate this line segment in the clockwise direction until an object pixel is found in the 8-connectivity. Again draw the line segment and rotate. Stopping criteria is when you encounter the starting pixel, a second time, with the same next pixel. For a demonstration, please refer to this.

These are some of the few algorithms for contour tracing. In the next blog, we will discuss the Suzuki’s Algorithm one that OpenCV uses for finding and drawing contours. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

References: Wikipedia, Imageprocessingplace

Integral images

In this blog, we will discuss the concept of integral images (or summed-area table, in general) that lets us efficiently compute the statistics like mean, standard deviation, etc in any rectangular window. This was introduced in 1984 by Frank Crow but this became popular due to its use in template matching and object detection (Source). So, let’s first discuss what is an integral image then discuss why it is efficient and how to compute the statistics from the integral image.

Integral image is obtained by summing all the pixels before each pixel (Naively you can think of this as similar to the cumulative distribution function where a particular value is obtained by summing all the values before). Let’s take an example to understand this.

Suppose we have a 5×5 binary image as shown below. The integral image is shown on the right.

All the pixels in the integral image are obtained by summing all the previous pixels. Previous here means all the pixels above and to the left of that pixel (inclusive of that pixel). For instance, the 3 (blue circle) is obtained by adding that pixel with the above and left pixels in the input image i.e. 1+0+0+1+0+0+0+1 = 3.

Finding the sum of pixels

Once the integral image is obtained, the sum of pixels in any rectangular region can be obtained in constant time (O(1) time complexity) by the following expression:

Sum = Bottom right + top left – top right – bottom left

For instance, the sum of all the pixels in the rectangular window can be obtained easily from the integral image using the above expression as shown below.

Here, top right (denoted by B) is 2, not 3. Be careful as we are finding the integral sum up to that point. For the ease of visualization, we can take a 4×4 window in the integral image and then perform the sum. For boundary pixels, pad with 0’s.

Now the mean can be calculated easily by dividing the sum by total pixels in that window. The standard deviation for any window can be obtained by the following formulae. This is obtained by simply expanding the variance formulae (See Wikipedia).

Here, S1 is the sum of the rectangular region in the input image and S2 is the sum of the square of that region in the input image and n is the no. of pixels in that region. Both S1 and S2 can be found out easily using the integral image. Now, let’s discuss how to implement this using OpenCV-Python. Let’s first discuss the builtin functions provided by OpenCV to calculate the integral image.

Here, src is the input image and sdepth is the optional argument denoting the depth of the integral image (must be of type CV_32S, CV_32F, or CV_64F). This returns an integral image which is of size (W+1)x(H+1) i.e. one more than the input image. Here, the first row and column of the integral image are all 0’s to deal with the boundary pixels as explained above. Rest all the pixels are obtained by summing all the previous pixels.

OpenCV also provides a function that returns the integral image of both the input image and its square. This can be done by the following function.

Here, sqdepth is the depth of the integral of the squared image (must be of type CV_32F, or CV_64F). This returns 2 arrays representing the integral of the input image and its square.

Calculate Standard deviation

Let’s verify that the standard deviation calculated by the above formulae yields correct results. For this, we will calculate the standard deviation using the builtin cv2.meanStdDev() function and then compare the results. Below is the code for this.

Thus, calculating the integral image is a simple operation that lets us calculate the image statistics super-fast. Later we will learn how this can be very useful in template matching, face detection, etc. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Image Pyramids

Image pyramid refers to the way of representing an image at multiple resolutions. The idea behind this is that features that may go undetected at one resolution can be easily detected at some other resolution. For instance, if the region of interest is large in size, a low-resolution image or coarse view is sufficient. While for small objects, it’s beneficial to examine them at high resolution. Now, if both large and small objects are present in an image, analyzing the image at several resolutions can prove beneficial. This is the main concept behind image pyramids. The name “pyramid” because if you place the high-resolution image at the bottom and stack subsequent low-resolution images on top, the appearance resembles that of a pyramid.

Thus constructing an image pyramid is equivalent to performing repeated smoothing and subsampling (reducing the size to half) an image. This is illustrated in the image below

Source: Wikipedia

Why blurring? Because this reduces the aliasing or ringing effects that may arise if we downsample directly. Now depending upon the type of blurring applied the pyramid is named. For instance, if we apply a mean filter, the pyramid is known as the mean pyramid, Gaussian filter – Gaussian pyramid and if we don’t apply any filtering, this is known as subsampling pyramid, etc. For subsampling, we can use any interpolation algorithm such as the nearest neighbor, bilinear, bicubic, etc. In this blog, we will discuss only two kinds of image pyramids

  • Gaussian Pyramid
  • Laplacian Pyramid

Gaussian pyramid involves applying repeated Gaussian blurring and downsampling an image until some stopping criteria are met. For instance, one of the stopping criteria can be the minimum image size. OpenCV provides a builtin function to perform blurring and downsampling as shown below

Here, src is the source image and rest are optional arguments which includes the output size (dstsize) and the border type. By default, the size of the output image is computed as Size((src.cols+1)/2, (src.rows+1)/2) i.e. the size is reduced to one-fourth at each step.

This function first convolves the input image with a 5×5 Gaussian kernel and then downsamples the image by rejecting even rows and columns. Below is an example of how to implement the above function.

Now, let’s discuss the Laplace pyramid. Since Laplacian is a high pass filter, so at each level of this pyramid, we will get an edge image as an output. As we have already discussed in the edge detection blog that the Laplacian can be approximated using the difference of Gaussian. So, here we will take advantage of this fact and obtain the Laplacian pyramid by subtracting the Gaussian pyramid levels. Thus the Laplacian of a level is obtained by subtracting that level in Gaussian Pyramid and expanded version of its upper level in Gaussian Pyramid. This is illustrated in the figure below.

OpenCV also provides a function to go down the image pyramid or expand a particular level as shown in the figure above.

This upsamples the input image by injecting even zero rows and columns and then convolves the result with the 5×5 Gaussian kernel multiplied by 4. By default, output image size is computed as Size(src.cols*2, (src.rows*2). Let’s take an example to illustrate the Laplacian pyramid.

Steps:

  • First load the image
  • Then construct the Gaussian pyramid with 3 levels.
  • For the Laplacian pyramid, the topmost level remains the same as in Gaussian. The remaining levels are constructed from top to bottom by subtracting that Gaussian level from its upper expanded level.

The Laplacian pyramid is mainly used for image compression. Image pyramids can also be used for image blending and for image enhancement which we will discuss in the next blog. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Image Blending using Image Pyramids

In the previous blog, we discussed image pyramids, and how to construct a Laplacian pyramid from the Gaussian. In this blog, we will discuss how image pyramids can be used for image blending. This produces more visually appealing results as compared to different blending methods we discussed until now. Below are the steps for image blending using image pyramids.

Steps:

  1. Load the two images and the mask.
  2. Find the Gaussian pyramid for the two images and the mask.
  3. From the Gaussian pyramid, calculate the Laplacian pyramid for the two images as explained in the previous blog.
  4. Now, blend each level of the Laplacian pyramid according to the mask image of the corresponding Gaussian level.
  5. From this blended Laplacian pyramid, reconstruct the original image. This is done by expanding the level and adding it to the below level as shown in the figure below. Here LS0, LS1, LS2, and LS3 are the levels of the blended Laplacian pyramid obtained in step 4.

Now, let’s implement the above steps using OpenCV-Python. Suppose we want to blend the two images corresponding to the mask as shown below.

Mask Image

So, we will clip the jet image from the second image and blend it to the first image. Below is the code for the steps explained above.

The blended output is shown below

Still, there is some amount of white gaze around the jet. Later, we will discuss gradient-domain blending methods which improve the result even more. Now, compare this image with a simple copy and paste operation and see the difference.

You can do a side-by-side blending also. In the next blog, we will discuss how to perform image enhancement and image compression using the Laplacian pyramids. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.