Author Archives: kang & atul

How to write rotated text using OpenCV-Python?

In this blog, we will discuss how to write the rotated text using OpenCV-Python. OpenCV as such do not provide any function for doing this. The cv2.putText() function only draws the text at the desired location and doesn’t account for the rotation. So, let’s discuss an alternative way to do this. This approach produces almost the expected results.

Approach

Instead of directly writing the rotated text, we can first write the text at the desired location and then rotate it. Below are the steps summarized to do this.

Create a zeros image of the desired shape
Draw the text using cv2.putText()
Rotate at the desired angle using cv2.warpAffine(). To know more, refer to this blog.

Below is the code for doing this using OpenCV-Python

import cv2
import numpy as np

# Create a zeros image
img = np.zeros((400,400), dtype=np.uint8)

# Specify the text location and rotation angle
text_location = (100,200)
angle = 30

# Draw the text using cv2.putText()
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(img, 'TheAILearner', text_location, font, 1, 255, 2)

# Rotate the image using cv2.warpAffine()
M = cv2.getRotationMatrix2D(text_location, angle, 1)
out = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]))

# Display the results
cv2.imshow('img',out)
cv2.waitKey(0)

import cv2

import numpy as np

# Create a zeros image

img = np.zeros((400,400), dtype=np.uint8)

# Specify the text location and rotation angle

text_location = (100,200)

angle = 30

# Draw the text using cv2.putText()

font = cv2.FONT_HERSHEY_SIMPLEX

cv2.putText(img, 'TheAILearner', text_location, font, 1, 255, 2)

# Rotate the image using cv2.warpAffine()

M = cv2.getRotationMatrix2D(text_location, angle, 1)

out = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]))

# Display the results

cv2.imshow('img',out)

cv2.waitKey(0)

Below is the output image for the 30-degree counterclockwise rotation. Here, the left image represents the original image while the right one is the rotated image.

Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Understanding Geometric Transformation: Rotation using OpenCV-Python

Leave a reply

In the previous blog, we discussed image translation. In this blog, we will discuss another type of transformation known as rotation. So, let’s get started. (Here, we will use a Left-hand coordinate system commonly used in image processing).

Suppose we have a point P(x,y) at an angle alpha and distance r from the origin as shown below. Now we rotate the point P about the origin by an angle theta in the clockwise direction. The rotated coordinates can be obtained as shown below.

So, we just need to create the transformation matrix (M) and then we can rotate any point as shown above. That’s the basic idea behind rotation. Now, let’s take the case with an adjustable center of rotation O(x₀, y₀).

Note: The above expression is for clockwise rotation. For anti-clockwise minor changes in the sign will occur. You can easily derive that.

Numpy

For the numpy implementation, you can refer to the previous blog. You just need to change the transformation matrix and rest everything is the same. Below is the code for this using numpy. For an explanation, you can refer to the previous blog.

import numpy as np
import cv2

# Read an image
img = cv2.imread('D:/downloads/opencv_logo.PNG')
rows,cols,_ = img.shape

# Create the transformation matrix
angle = np.radians(90)
x0, y0 = ((cols-1)/2.0,(rows-1)/2.0)
M = np.array([[np.cos(angle), -np.sin(angle), x0*(1-np.cos(angle))+ y0*np.sin(angle)],
              [np.sin(angle), np.cos(angle), y0*(1-np.cos(angle))- x0*np.sin(angle)]])
# get the coordinates in the form of (0,0),(0,1)...
# the shape is (2, rows*cols)
orig_coord = np.indices((cols, rows)).reshape(2,-1)
# stack the rows of 1 to form [x,y,1]
orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))
transform_coord = np.dot(M, orig_coord_f)
# Change into int type
transform_coord = transform_coord.astype(np.int)
# Keep only the coordinates that fall within the image boundary.
indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)
# Create a zeros image and project the points
img1 = np.zeros_like(img)
img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]
# Display the image
out = cv2.hconcat([img,img1])
cv2.imshow('a2',out)
cv2.waitKey(0)

import numpy as np

import cv2

# Read an image

img = cv2.imread('D:/downloads/opencv_logo.PNG')

rows,cols,_ = img.shape

# Create the transformation matrix

angle = np.radians(90)

x0, y0 = ((cols-1)/2.0,(rows-1)/2.0)

M = np.array([[np.cos(angle), -np.sin(angle), x0*(1-np.cos(angle))+ y0*np.sin(angle)],

[np.sin(angle), np.cos(angle), y0*(1-np.cos(angle))- x0*np.sin(angle)]])

# get the coordinates in the form of (0,0),(0,1)...

# the shape is (2, rows*cols)

orig_coord = np.indices((cols, rows)).reshape(2,-1)

# stack the rows of 1 to form [x,y,1]

orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))

transform_coord = np.dot(M, orig_coord_f)

# Change into int type

transform_coord = transform_coord.astype(np.int)

# Keep only the coordinates that fall within the image boundary.

indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)

# Create a zeros image and project the points

img1 = np.zeros_like(img)

img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]

# Display the image

out = cv2.hconcat([img,img1])

cv2.imshow('a2',out)

cv2.waitKey(0)

Below is the output image for the 90-degree clockwise rotation. Here, the left image represents the original image while the right one is the rotated image.

While rotating an image, you may encounter an aliasing effect or holes in the output image as shown below for 45-degree rotation. This can be easily tackled using interpolation.

OpenCV

Now, let’s discuss how to rotate images using OpenCV-Python. In order to obtain the transformation matrix (M), OpenCV provide a function cv2.getRotationMatrix2D() which takes center, angle and scale as arguments and outputs the transformation matrix. The syntax of this function is given below.

transform_matrix = cv2.getRotationMatrix2D(center, angle, scale)

#center: Center of the rotation in the source image.
#angle: Rotation angle in degrees. Positive values mean counter-clockwise rotation (the coordinate origin is assumed to be the top-left corner).
#scale:	Isotropic scale factor.

transform_matrix = cv2.getRotationMatrix2D(center, angle, scale)

#center: Center of the rotation in the source image.

#angle: Rotation angle in degrees. Positive values mean counter-clockwise rotation (the coordinate origin is assumed to be the top-left corner).

#scale: Isotropic scale factor.

Once the transformation matrix (M) is calculated, pass it to the cv2.warpAffine() function that applies an affine transformation to an image. The syntax of this function is given below.

dst = cv.warpAffine(src, M, dsize[, dst[, flags[, borderMode[, borderValue]]]] )
 
# src: input image
# M: Transformation matrix
# dsize: size of the output image
# flags: interpolation method to be used

dst = cv.warpAffine(src, M, dsize[, dst[, flags[, borderMode[, borderValue]]]] )

# src: input image

# M: Transformation matrix

# dsize: size of the output image

# flags: interpolation method to be used

Below is an example where the image is rotated by 90 degrees counterclockwise with respect to the center without any scaling.

import numpy as np
import cv2

# Read an image
img = cv2.imread('D:/downloads/opencv_logo.PNG')
rows,cols,_ = img.shape

# Create the transformation matrix
M = cv2.getRotationMatrix2D(((cols-1)/2.0,(rows-1)/2.0),90,1)

# Pass it to warpAffine function
dst = cv2.warpAffine(img,M,(cols,rows))

# Display the concatenated image
out = cv2.hconcat([img, dst])
cv2.imshow('img',out)
cv2.waitKey(0)

import numpy as np

import cv2

# Read an image

img = cv2.imread('D:/downloads/opencv_logo.PNG')

rows,cols,_ = img.shape

# Create the transformation matrix

M = cv2.getRotationMatrix2D(((cols-1)/2.0,(rows-1)/2.0),90,1)

# Pass it to warpAffine function

dst = cv2.warpAffine(img,M,(cols,rows))

# Display the concatenated image

out = cv2.hconcat([img, dst])

cv2.imshow('img',out)

cv2.waitKey(0)

Below is the output. Here, left image represents the original image while the right one is the rotated image.

Compare the outputs of both implementations. That’s all for image rotation. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Understanding Geometric Transformation: Translation using OpenCV-Python

Leave a reply

In this blog, we will discuss image translation one of the most basic geometric transformations, that is performed on images. So, let’s get started.

Translation is simply the shifting of object location. Suppose we have a point P(x,y) which is translated by (t_x, t_y), then the coordinates after translation denoted by P'(x’,y’) are given by

So, we just need to create the transformation matrix (M) and then we can translate any point as shown above. That’s the basic idea behind translation. So, let’s first discuss how to do image translation using numpy for better understanding, and then we will see a more sophisticated implementation using OpenCV.

Numpy

First, let’s create the transformation matrix (M). This can be easily done using numpy as shown below. Here, the image is translated by (100, 50)

M = np.float32([[1,0,100],[0,1,50]])

1	M = np.float32([[1,0,100],[0,1,50]])

Next, let’s convert the image coordinates to the form [x,y,1]. This can be done as

# get the coordinates in the form of (0,0),(0,1)...
# the shape is (2, rows*cols)
orig_coord = np.indices((cols, rows)).reshape(2,-1)
# stack the rows of 1 to form [x,y,1]
orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))

# get the coordinates in the form of (0,0),(0,1)...

# the shape is (2, rows*cols)

orig_coord = np.indices((cols, rows)).reshape(2,-1)

# stack the rows of 1 to form [x,y,1]

orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))

Now apply the transformation by multiplying the transformation matrix with coordinates.

transform_coord = np.dot(M, orig_coord_f)
# Change into int type
transform_coord = transform_coord.astype(np.int)

transform_coord = np.dot(M, orig_coord_f)

# Change into int type

transform_coord = transform_coord.astype(np.int)

Keep only the coordinates that fall within the image boundary.

indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)

1	indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)

Now, create a zeros image similar to the original image and project all the points onto the new image.

img1 = np.zeros_like(img)
img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]

1 2	img1 = np.zeros_like(img) img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]

Display the final image.

out = cv2.hconcat([img,img1])
cv2.imshow('a1',out)
cv2.waitKey(0)

out = cv2.hconcat([img,img1])

cv2.imshow('a1',out)

cv2.waitKey(0)

The full code can be found below

import numpy as np
import cv2

# Read an image
img = cv2.imread('D:/downloads/opencv_logo.PNG')
rows,cols,_ = img.shape

# Create the transformation matrix
M = np.float32([[1,0,100],[0,1,50]])
# get the coordinates in the form of (0,0),(0,1)...
# the shape is (2, rows*cols)
orig_coord = np.indices((cols, rows)).reshape(2,-1)
# stack the rows of 1 to form [x,y,1]
orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))
transform_coord = np.dot(M, orig_coord_f)
# Change into int type
transform_coord = transform_coord.astype(np.int)
# Keep only the coordinates that fall within the image boundary.
indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)
# Create a zeros image and project the points
img1 = np.zeros_like(img)
img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]
# Display the image
out = cv2.hconcat([img,img1])
cv2.imshow('a2',out)
cv2.waitKey(0)

import numpy as np

import cv2

# Read an image

img = cv2.imread('D:/downloads/opencv_logo.PNG')

rows,cols,_ = img.shape

# Create the transformation matrix

M = np.float32([[1,0,100],[0,1,50]])

# get the coordinates in the form of (0,0),(0,1)...

# the shape is (2, rows*cols)

orig_coord = np.indices((cols, rows)).reshape(2,-1)

# stack the rows of 1 to form [x,y,1]

orig_coord_f = np.vstack((orig_coord, np.ones(rows*cols)))

transform_coord = np.dot(M, orig_coord_f)

# Change into int type

transform_coord = transform_coord.astype(np.int)

# Keep only the coordinates that fall within the image boundary.

indices = np.all((transform_coord[1]<rows, transform_coord[0]<cols, transform_coord[1]>=0, transform_coord[0]>=0), axis=0)

# Create a zeros image and project the points

img1 = np.zeros_like(img)

img1[transform_coord[1][indices], transform_coord[0][indices]] = img[orig_coord[1][indices], orig_coord[0][indices]]

# Display the image

out = cv2.hconcat([img,img1])

cv2.imshow('a2',out)

cv2.waitKey(0)

Below is the output. Here, left image represents the original image while the right one is the translated image.

OpenCV-Python

Now, let’s discuss how to translate images using OpenCV-Python.

OpenCV provides a function cv2.warpAffine() that applies an affine transformation to an image. You just need to provide the transformation matrix (M). The basic syntax for the function is given below.

dst = cv.warpAffine(src, M, dsize[, dst[, flags[, borderMode[, borderValue]]]]	)

# src: input image
# M: Transformation matrix
# dsize: size of the output image
# flags: interpolation method to be used

dst = cv.warpAffine(src, M, dsize[, dst[, flags[, borderMode[, borderValue]]]] )

# src: input image

# M: Transformation matrix

# dsize: size of the output image

# flags: interpolation method to be used

Below is a sample code where the image is translated by (100, 50).

import numpy as np
import cv2

# Read an image
img = cv2.imread('D:/downloads/opencv_logo.PNG')
rows,cols,_ = img.shape

# Create the transformation matrix
M = np.float32([[1,0,100],[0,1,50]])

# Pass it to warpAffine function
dst = cv2.warpAffine(img,M,(cols,rows))

# Display the concatenated image
out = cv2.hconcat([img, dst])
cv2.imshow('img',out)
cv2.waitKey(0)

import numpy as np

import cv2

# Read an image

img = cv2.imread('D:/downloads/opencv_logo.PNG')

rows,cols,_ = img.shape

# Create the transformation matrix

M = np.float32([[1,0,100],[0,1,50]])

# Pass it to warpAffine function

dst = cv2.warpAffine(img,M,(cols,rows))

# Display the concatenated image

out = cv2.hconcat([img, dst])

cv2.imshow('img',out)

cv2.waitKey(0)

Below is the output. Here, left image represents the original image while the right one is the translated image.

Compare the outputs of both implementations. That’s all for image translation. In the next blog, we will discuss another geometric transformation known as rotation in detail. Hope you enjoy reading.

If you have any doubts/suggestions please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Image Moments

3 Replies

In this blog, we will discuss how to find different features of contours such as area, centroid, orientation, etc. With the help of these features/statistics, we can do some sort of recognition. So, in this blog, we will refer to a very old fundamental work in computer vision known as Image moments that helps us to calculate these statistics. So, let’s first discuss what are image moments and how to calculate them.

In simple terms, image moments are a set of statistical parameters to measure the distribution of where the pixels are and their intensities. Mathematically, the image moment M_ij of order (i,j) for a greyscale image with pixel intensities I(x,y) is calculated as

Here, x, y refers to the row and column index and I(x,y) refers to the intensity at that location (x,y). Now, let’s discuss how simple image properties are calculated from image moments.

Area:

For a binary image, the zeroth order moment corresponds to the area. Let’s discuss how?

Using the above formulae, the zeroth order moment (M₀₀) is given by

For a binary image, this corresponds to counting all the non-zero pixels and that is equivalent to the area. For greyscale image, this corresponds to the sum of pixel intensity values.

Centroid:

Centroid simply is the arithmetic mean position of all the points. In terms of image moments, centroid is given by the relation

This is simple to understand. For instance, for a binary image M₁₀ corresponds to the sum of all non-zero pixels (x-coordinate) and M₀₀ is the total number of non-zero pixels and that is what the centroid is.

Let’s take a simple example to understand how to calculate image moments for a given image.

Below are the area and centroid calculation for the above image

OpenCV-Python

OpenCV provides a function cv2.moments() that outputs a dictionary containing all the moment values up to 3^rd order.

output = cv2.moments(input[,binaryImage])

# input: image(single channel) or array of 2D points. Should be either np.int32 or np.float32.
# binaryImage: Only used if input is image. If True all the non-zero pixels are treated as 1's.

output = cv2.moments(input[,binaryImage])

# input: image(single channel) or array of 2D points. Should be either np.int32 or np.float32.

# binaryImage: Only used if input is image. If True all the non-zero pixels are treated as 1's.

Below is the sample code that shows how to use cv2.moments().

import cv2
# read the image
img = cv2.imread('star.jpg',0)
# Binarize the image
ret,thresh = cv2.threshold(img,127,255,0)
# Find the contours
contours,hierarchy = cv2.findContours(thresh, 1, 2)
cnt = contours[0]
# Calculate the moments
M = cv2.moments(cnt)

import cv2

# read the image

img = cv2.imread('star.jpg',0)

# Binarize the image

ret,thresh = cv2.threshold(img,127,255,0)

# Find the contours

contours,hierarchy = cv2.findContours(thresh, 1, 2)

cnt = contours[0]

# Calculate the moments

M = cv2.moments(cnt)

From this moments dictionary, we can easily extract the useful features such as area, centroid etc. as shown below.

# Calculate area
area = M['m00']
# Calculate centroid
cx = int(M['m10']/M['m00'])
cy = int(M['m01']/M['m00'])

# Calculate area

area = M['m00']

# Calculate centroid

cx = int(M['m10']/M['m00'])

cy = int(M['m01']/M['m00'])

That’s all about image moments. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Removing Text highlighter using Colorspace OpenCV-Python

Leave a reply

Have you ever thought why a number of color models are available in OpenCV? Obviously, they might have some pros and cons. So, in this blog, we will discuss one such application of color models where we will learn to remove the highlighted area from the text.

Use Case:

This pre-processing step (removing text highlighter) can be quite useful before feeding the image to an OCR system. Otherwise, the OCR system will output erroneous results.

Problem Overview

Suppose we are given an image as shown on the left and we want to pre-process it to remove the highlighter from the text as shown by the right image below.

Approach:

Since we know that there are some color models available (such as HSV) where it is easy to represent the color as compared to the RGB model. So, we will convert the image from RGB to that colorspace and then remove the color information. For instance, in the HSV color model, H and S tell us about the chromaticity (color information) of the light while V carries the greyscale information. So in HSV, if we remove the H and S channel and only keep the V channel we can obtain the desired results.

Steps:

Read the highlighted text image
Convert from BGR to HSV colorspace using cv2.cvtColor()
Extract the V channel

Code:

import cv2
# read the image
img = cv2.imread('D:/downloads/highlighted_text.JPG')
# Convert from BGR to HSV
img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Extract the V channel
out_img = img_hsv[:,:,2]
# Display the image
cv2.imshow('output_image', out_img)
cv2.waitKey(0)

import cv2

# read the image

img = cv2.imread('D:/downloads/highlighted_text.JPG')

# Convert from BGR to HSV

img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Extract the V channel

out_img = img_hsv[:,:,2]

# Display the image

cv2.imshow('output_image', out_img)

cv2.waitKey(0)

So, you saw that just by changing the colorspace and extracting channels we obtained satisfactory results. We can further improve the results by applying other operations such as thresholding or morphological operations etc. Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

GAN to Generate Images of Climate Change

Leave a reply

Generative adversarial networks (GANs) are deep learning models that are used to generate images similar to real images. Images generated by GANs are both realistic and personalized. But to generate images of high quality, the network requires a huge amount of data. Their usability limits in case of a low quantity of data. In this blog, we will discuss how we can use simulated data to generate images of climate change using GANs in case we are having a scarcity of training data.

Introduction

Recently researchers with the Montreal Institute for Learning Algorithms (MILA) used generative adversarial networks to generate images of the world after the flood. They tried to show how the world would change if some calamity like a flood occurs. They hope people would work to avert future weather conditions if they can see these changes. Researchers used simulated data in combination with real images to train multimodel unsupervised image-to-image translation with some modification to architecture.

Data Collection

Real Dataset

Researchers have collected 2000 real images of flooded and non-flooded scenes taken in various weather conditions, seasons, time and viewpoints. These images were taken from publicly available datasets Mapilary and Flickr. They used this dataset and trained CycleGAN but the generated images were not sufficiently realistic. To cop up with this problem they used simulated data.

Simulated Dataset

To generate simulated dataset researchers used the Unity 3D game engine. They created different types of building in combination with urban and rural environments. As a starting point, they generated 1000 unique pairs of images with flooded and non-flooded domains.

Domain Adaption Technique

While using simulated data, authors have seen domain gap between training dataset made up of simulated data and testing data made up of real images. To bridge this gap they used domain adaption technique inspired by unsupervised semantic segmentation. This technique is being implemented by using an adversarial classifier within MUNIT architecture.

Network Architecture

Researchers have tried different image-to-image translation GANs like CycleGAN, InstaGAN, and MUNIT. CycleGAN and InstaGAN were not able to generate as realistic water texture as MUNIT was able to. Finally, they used MUNIT architecture with some modifications.

MUNIT architecture relies on two generators and two discriminators to disentangle the style and content of the images. Such that during the generation of the image only style changes and content remains the same. To make MUNIT architecture more compatible with climate change use case, researchers have made the following changes to the architecture:

Restriction of Cycle COnsistency Loss: In image-to-image translation GANs, cycle consistency loss is used to make sure that translation is cycle consistent. Let say, If we translate from English to French and then translate back to English sentence, we should arrive at the original sentence. In this architecture, researchers have restricted the network’s cycle consistency loss such that this loss is only computed on those regions that are not likely to be flooded. To do this they have used the binary masks of the areas.
Introduction of semantic consistency loss: This loss confirms that the semantic segmentation structure for the generated image is the same as the source image except for the areas where changes occurred like the road to the flooded area.

This approach uses both real and simulated data to perform image-to-image translation to show the effects of climate change. This approach clearly shows that simulated data helps in generating more realistic images. Researchers are still working on to improve the results of this model. They are also working to create an interactive website.

“Authors aim to develop an interactive website that, given a user-entered address, will query the Google Street View API (Anguelov et al., 2010) to get an image of the location and alter it to display a plausible image of its climate future based on the predictions of climate models. We hope this tool will help communicate effectively on climate change related risks.“

Referenced Research Paper: Using Simulated data to generate images of climate change

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Information Maximizing Generative Adversarial Network (InfoGAN): Introduction and Implementation

Leave a reply

InfoGAN is an extension to the generative adversarial networks. Generative adversarial networks are trained to generate new images that look similar to the original images. But they do not provide any control over the generation of the new images. Let’s say you have trained a GAN network to generate new faces that look similar to the given dataset. But there you will not have any control over these faces such as the colour of the eyes, hairstyles, etc. But with the help of InfoGAN, we can achieve these results because InfoGAN is able to learn the disentangled representation.

Introduction

A generative adversarial network consist of two networks – a generator and a discriminator. Both of these networks are trained in an adversarial manner. While the generator tries to generate images similar to original images, discriminator tries to differentiate between images generated by the generator and original images. Training continues until discriminator is fooled half the time by generator and generator is able to generate images similar to original images.

Control Variables

In a general GAN, a random input noise vector is given as input to the generator network which does not provide any information to the generator network i.e. in which manner outputs should be generated. While InfoGAN uses latent code along with noise vector to generate images accordingly. Input to the generator of the InfoGAN can be given in two parts:

Continuous noise vector, z.
Latent codes which can be both discrete and continuous, c.

Let say we have trained our InfoGAN on MNIST handwritten digit datasets. Here discrete latent codes (0-9) can be used to generate specific digits between 0-9. While continuous latent codes can be used to generate digits with varying thickness and orientation.

Mutual Information

InfoGAN stands for information maximizing GAN. To maximize information, InfoGAN uses mutual information. In information theory, the mutual information between X and Y, I(X; Y ), measures the “amount of information” learned from knowledge of random variable Y about the other random variable X. In InfoGAN there should be high mutual information between latent code c and generated images.

To maximize this mutual information, the InfoGAN model requires an extra network named as an auxiliary model. This auxiliary model shares all the weights from the discriminator network except the output layer. As the discriminator network has an output layer which predicts the given input image is real or fake, the auxiliary network predicts the latent codes.

So the InfoGAN will consist of three networks – Generator, Discriminator, and auxiliary network. Both the discriminator and auxiliary networks are used to improve the generator network. Here, the generation of real looking images by generator network is regularized by the discriminator network and maximization of mutual information is regularized by the auxiliary network.

Implementation

In this blog, we will implement InfoGAN using MNIST handwritten digit dataset. To maximize the information we will only use discrete codes to generate particular digits. In addition to this, you can also use two continuous variables to define the rotation and thickness of the generated digits.

Imports and Initialization

(x_train, y_train), (x_test, y_test) = mnist.load_data()
batch_size = 16
half_batch_size = 8
latent_dim = 100 + 10
iterations = 60000
optimizer = Adam(0.0002, 0.5)

(x_train, y_train), (x_test, y_test) = mnist.load_data()

batch_size = 16

half_batch_size = 8

latent_dim = 100 + 10

iterations = 60000

optimizer = Adam(0.0002, 0.5)

Generator Network

def generator():

    input_gen = Input(shape = (latent_dim,))
    dense1 = Reshape((7,7,16))(Dense(7*7*16)(input_gen))

    batch_norm_1 = BatchNormalization()(dense1)
    trans_1 = Conv2DTranspose(128, 3, padding='same', activation=LeakyReLU(alpha=0.2), strides=(2, 2))(batch_norm_1)
    batch_norm_2 = BatchNormalization()(trans_1)
    trans_2 = Conv2DTranspose(128, 3, padding='same', activation=LeakyReLU(alpha=0.2), strides=(2, 2))(batch_norm_2)
    output = Conv2D(1, (28,28), activation='tanh', padding='same')(trans_2)
    gen_model = Model(input_gen, output)
    gen_model.compile(loss='binary_crossentropy', optimizer=optimizer)
    print(gen_model.summary())

    return gen_model

def generator():

input_gen = Input(shape = (latent_dim,))

dense1 = Reshape((7,7,16))(Dense(7*7*16)(input_gen))

batch_norm_1 = BatchNormalization()(dense1)

trans_1 = Conv2DTranspose(128, 3, padding='same', activation=LeakyReLU(alpha=0.2), strides=(2, 2))(batch_norm_1)

batch_norm_2 = BatchNormalization()(trans_1)

trans_2 = Conv2DTranspose(128, 3, padding='same', activation=LeakyReLU(alpha=0.2), strides=(2, 2))(batch_norm_2)

output = Conv2D(1, (28,28), activation='tanh', padding='same')(trans_2)

gen_model = Model(input_gen, output)

gen_model.compile(loss='binary_crossentropy', optimizer=optimizer)

print(gen_model.summary())

return gen_model

Input to the generator network consists of shape (110, 1), where 100 is the noise vector size and 10 is the latent code size. Here latent codes are one-hot encoded discrete number between 0-9. I have used deconvolutional layers to upsample and finally produce the shape of (28,28,1). Batch normalization is used to improve the quality of the trained network and for stabilization.

Discriminator and Auxiliary Network

def discriminator():

    input_disc = Input(shape = (28, 28, 1))

    conv_1 = Conv2D(16, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(input_disc)
    batch_norm1 = BatchNormalization()(conv_1)
    pool_1 = AveragePooling2D(strides = (2,2))(batch_norm1)
    conv_2 = Conv2D(32, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(pool_1)
    batch_norm2 = BatchNormalization()(conv_2)
    pool_2 = AveragePooling2D(strides = (2,2))(batch_norm2)
    conv_3 = Conv2D(64, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(pool_2)
    batch_norm3 = BatchNormalization()(conv_3)
    pool_3 = AveragePooling2D(strides = (2,2))(conv_3)
    flatten_1 = Flatten()(pool_3)
    output = Dense(1, activation = 'sigmoid')(flatten_1)
    q_output_catgorical = Dense(10, activation = 'softmax')(flatten_1)
    
    disc_model = Model(input_disc, output)
    disc_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    q_model = Model(input_disc, q_output_catgorical)
    q_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    
    print(disc_model.summary())
    print(q_model.summary())

    return disc_model, q_model

def discriminator():

input_disc = Input(shape = (28, 28, 1))

conv_1 = Conv2D(16, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(input_disc)

batch_norm1 = BatchNormalization()(conv_1)

pool_1 = AveragePooling2D(strides = (2,2))(batch_norm1)

conv_2 = Conv2D(32, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(pool_1)

batch_norm2 = BatchNormalization()(conv_2)

pool_2 = AveragePooling2D(strides = (2,2))(batch_norm2)

conv_3 = Conv2D(64, 3, padding = 'same', activation = LeakyReLU(alpha=0.2))(pool_2)

batch_norm3 = BatchNormalization()(conv_3)

pool_3 = AveragePooling2D(strides = (2,2))(conv_3)

flatten_1 = Flatten()(pool_3)

output = Dense(1, activation = 'sigmoid')(flatten_1)

q_output_catgorical = Dense(10, activation = 'softmax')(flatten_1)

disc_model = Model(input_disc, output)

disc_model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

q_model = Model(input_disc, q_output_catgorical)

q_model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

print(disc_model.summary())

print(q_model.summary())

return disc_model, q_model

As I have already told that auxiliary network shares all the weights of the discriminator network except the output layer there is no need to create two separate functions for this. Networks take images of shape (28, 28, 1) as input. convolutional, batch normalization and pooling layers are used to create the network. The output shape of the discriminator network is 1 as it only predicts the input image is real or fake. But the output shape of the auxiliary network is 10 as it predicts latent code.

Combined Model

def combined():

    inputs = Input(shape = (latent_dim,)) 
    gen_img = generator_model(inputs)
    
    discriminator_model.trainable = False
    
    disc_outs = discriminator_model(gen_img)
    q_outs = auxiliary_model(gen_img)
    
    comb_model = Model(inputs, [disc_outs, q_outs])
    comb_model.compile(loss=['binary_crossentropy', 'categorical_crossentropy'], optimizer=optimizer, metrics=['accuracy'])
    print(comb_model.summary())

    return comb_model

def combined():

inputs = Input(shape = (latent_dim,))

gen_img = generator_model(inputs)

discriminator_model.trainable = False

disc_outs = discriminator_model(gen_img)

q_outs = auxiliary_model(gen_img)

comb_model = Model(inputs, [disc_outs, q_outs])

comb_model.compile(loss=['binary_crossentropy', 'categorical_crossentropy'], optimizer=optimizer, metrics=['accuracy'])

print(comb_model.summary())

return comb_model

A combined model is created to train the generator network. Here we do discriminator network as non-trainable as discriminator network is trained separately. The combined model takes random noise and latent code as input. This input is fed to the generator network and the generated image is fed to both discriminator and auxiliary network.

Training InfoGAN

Training a GAN model is always a difficult task. A careful hyperparameter tuning is always required. We will use the following steps to train the InfoGAN model.

Normalize the input images from the MNIST dataset.
Train the discriminator model using real images from the MNIST dataset.
Train the discriminator model using real images and corresponding labels.
Train the discriminator model using fake images generated from the generator network.
Train the auxiliary network using fake images generated from the generator and random latent codes.
Train the generator network using a combined model without training the discriminator.
Repeat the steps from 2-6 for some iterations. I have trained it for 60000 iterations.

generator_model = generator() 
discriminator_model, auxiliary_model = discriminator()
combined_model = combined()

def train():

    train_data = (x_train.astype(np.float32) - 127.5) / 127.5
    train_data = np.expand_dims(train_data, -1)
    train_data_y = y_train

    for i in range(iterations):

        batch_indx = np.random.randint(0, train_data.shape[0], size = (half_batch_size))
        batch_x = train_data[batch_indx]
        batch_y = to_categorical(train_data_y[batch_indx], 10)

        real_loss = discriminator_model.train_on_batch(batch_x, np.ones((half_batch_size,1)))

        q_real_loss = auxiliary_model.train_on_batch(batch_x, batch_y)

        random_y = to_categorical(np.random.randint(0,10,half_batch_size), 10)
        input_noise = np.random.normal(0, 1, size=(half_batch_size, 100))
        
        gen_outs = generator_model.predict(np.hstack((input_noise, random_y)))

        fake_loss = discriminator_model.train_on_batch(gen_outs, np.zeros((half_batch_size,1)))
        q_fake_loss = auxiliary_model.train_on_batch(gen_outs, random_y)

        noise = np.random.normal(0, 1, size=(batch_size, 100))
        latent_code = to_categorical(np.random.randint(0,10,batch_size), 10)

        full_batch_input_noise = np.hstack((noise, latent_code))

        gan_loss = combined_model.train_on_batch(full_batch_input_noise, [np.ones((batch_size,1)), latent_code])
        
        if i%5000 == 0:
            print(i, fake_loss, real_loss, gan_loss, q_real_loss, q_fake_loss)

generator_model = generator()

discriminator_model, auxiliary_model = discriminator()

combined_model = combined()

def train():

train_data = (x_train.astype(np.float32) - 127.5) / 127.5

train_data = np.expand_dims(train_data, -1)

train_data_y = y_train

for i in range(iterations):

batch_indx = np.random.randint(0, train_data.shape[0], size = (half_batch_size))

batch_x = train_data[batch_indx]

batch_y = to_categorical(train_data_y[batch_indx], 10)

real_loss = discriminator_model.train_on_batch(batch_x, np.ones((half_batch_size,1)))

q_real_loss = auxiliary_model.train_on_batch(batch_x, batch_y)

random_y = to_categorical(np.random.randint(0,10,half_batch_size), 10)

input_noise = np.random.normal(0, 1, size=(half_batch_size, 100))

gen_outs = generator_model.predict(np.hstack((input_noise, random_y)))

fake_loss = discriminator_model.train_on_batch(gen_outs, np.zeros((half_batch_size,1)))

q_fake_loss = auxiliary_model.train_on_batch(gen_outs, random_y)

noise = np.random.normal(0, 1, size=(batch_size, 100))

latent_code = to_categorical(np.random.randint(0,10,batch_size), 10)

full_batch_input_noise = np.hstack((noise, latent_code))

gan_loss = combined_model.train_on_batch(full_batch_input_noise, [np.ones((batch_size,1)), latent_code])

if i%5000 == 0:

print(i, fake_loss, real_loss, gan_loss, q_real_loss, q_fake_loss)

Generation

Now we will generate images from the trained gan model. The generator will be provided with random noise and one hot encoded input digit between 0-9 whichever digit we want to generate.

# generating new images from trained network
import matplotlib.pyplot as plt

r, c = 10, 5

gen_imgs = []

for indx in range(10):
    
    noise = np.random.normal(0, 1, (5, 100))
    categorical_code = to_categorical([indx]*5, 10)
    
    input_noise = np.hstack((noise, categorical_code))
    outs = generator_model.predict(input_noise)
    gen_imgs.extend(outs)
    
gen_imgs = np.array(gen_imgs)
gen_imgs = 0.5 * gen_imgs + 0.5
fig, axs = plt.subplots(r, c)
cnt = 0
for i in range(r):
    for j in range(c):
        axs[i,j].imshow(gen_imgs[cnt, :,:,0], cmap='gray')
        axs[i,j].axis('off')
        cnt += 1
        
plt.show()
fig.savefig("mnist.png")
plt.close()

# generating new images from trained network

import matplotlib.pyplot as plt

r, c = 10, 5

gen_imgs = []

for indx in range(10):

noise = np.random.normal(0, 1, (5, 100))

categorical_code = to_categorical([indx]*5, 10)

input_noise = np.hstack((noise, categorical_code))

outs = generator_model.predict(input_noise)

gen_imgs.extend(outs)

gen_imgs = np.array(gen_imgs)

gen_imgs = 0.5 * gen_imgs + 0.5

fig, axs = plt.subplots(r, c)

cnt = 0

for i in range(r):

for j in range(c):

axs[i,j].imshow(gen_imgs[cnt, :,:,0], cmap='gray')

axs[i,j].axis('off')

cnt += 1

plt.show()

fig.savefig("mnist.png")

plt.close()

Here are the generated results from the model:

Referenced Research Paper: InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Finding Convex Hull OpenCV Python

1 Reply

In the previous blog, we discussed how to perform simple shape detection using contour approximation. In this blog, we will discuss how to find the convex hull of a given shape/curve. So, let’s first discuss what is a convex hull?

What is a Convex Hull?

Any region/shape is said to be convex if the line joining any two points (selected from the region) is contained entirely in that region. Another way of saying this is, for a shape to be convex, all of its interior angles must be less than 180 degrees or all the vertices should open towards the center. Let’s understand this with the help of the image below.

Now, for a given shape or set of points, we can have many convex curves/boundaries. The smallest or the tight-fitting convex boundary is known as a convex hull.

Now, the next question that comes to our mind is how to find the convex hull for a given shape or set of points? There are so many algorithms for finding the convex hull. Some of the most common algorithms with their associated time complexities are shown below. Here, n is the no. of input points and h is the number of points on the hull.

Gift wrapping, a.k.a. Jarvis march — O(nh)
Graham scan — O(nlogn)
Chan’s algorithm — O(nlogh)
Sklansky (1982) — O(nlogn) ( OpenCV uses this algorithm)

OpenCV provides a builtin function for finding the convex hull of a point set as shown below

hull = cv2.convexHull(points [,clockwise [,returnPoints]])

1	hull = cv2.convexHull(points [,clockwise [,returnPoints]])

points: any contour or Input 2D point set whose convex hull we want to find.
clockwise: If it is True, the output convex hull is oriented clockwise. Otherwise, counter-clockwise.
returnPoints: If True (default) then returns the coordinates of the hull points. Otherwise, returns the indices of contour points corresponding to the hull points. Thus to find the actual hull coordinates in the second(False) case, we need to do contour[indices].

Now, let’s take an example and understand how to find the convex hull for a given image using OpenCV-Python.

Steps:

Load the image
Convert it to greyscale
Threshold the image
Find the contours
For each contour, find the convex hull and draw it.

import cv2
# Load the image
img1 = cv2.imread('D:/downloads/cars.JPG')
# Convert it to greyscale
img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)
# Threshold the image
ret, thresh = cv2.threshold(img,50,255,0)
# Find the contours
contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# For each contour, find the convex hull and draw it
# on the original image.
for i in range(len(contours)):
    hull = cv2.convexHull(contours[i])
    cv2.drawContours(img1, [hull], -1, (255, 0, 0), 2)
# Display the final convex hull image
cv2.imshow('ConvexHull', img1)
cv2.waitKey(0)

import cv2

# Load the image

img1 = cv2.imread('D:/downloads/cars.JPG')

# Convert it to greyscale

img = cv2.cvtColor(img1, cv2.COLOR_BGR2GRAY)

# Threshold the image

ret, thresh = cv2.threshold(img,50,255,0)

# Find the contours

contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# For each contour, find the convex hull and draw it

# on the original image.

for i in range(len(contours)):

hull = cv2.convexHull(contours[i])

cv2.drawContours(img1, [hull], -1, (255, 0, 0), 2)

# Display the final convex hull image

cv2.imshow('ConvexHull', img1)

cv2.waitKey(0)

Below is the output of the above code.

Applications:

Collision detection or avoidance.
Face Swap
Shape analysis and many more.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Simple Shape Detection using Contour approximation

Leave a reply

In the previous blog, we learned how to find and draw contours using OpenCV. In this blog, we will discuss how to detect simple geometric shapes by approximating the contours. So, let’s first discuss what is meant by contour approximation.

This means approximating a contour shape to another shape with less number of vertices so that the distance between both the shapes is less or equal to the specified precision. The below figure shows the curve approximation for different precisions (epsilon). See how the shape is approximated to a rectangle with epsilon =10% in the below image.

Contour approximation for different epsilon — **Source: OpenCV**

This is widely used in robotics for pattern classification and scene analysis. OpenCV provides a builtin function that approximates the polygonal curves with the specified precision. Its implementation is based on the Douglas-Peucker algorithm.

approxCurve = cv2.approxPolyDP(curve, epsilon, closed)

1	approxCurve = cv2.approxPolyDP(curve, epsilon, closed)

“curve“: contour/polygon we want to approximate.
“epsilon“: This is the maximum distance between the original curve and its approximation.
“closed“: If true, the approximated curve is closed otherwise, not.

This function returns the approximated contour with the same type as that of the input curve. Now, let’s detect simple shapes using this concept. Let’s take the below image to perform shape detection.

Steps

Load the image and convert to greyscale
Apply thresholding and find contours
For each contour
- First, approximate its shape using cv2.approxPolyDP()
- if len(shape) == 3; shape is Triangle
- else if len(shape) == 4; shape is Rectangle
- else if len(shape) == 5; shape is Pentagon
- else if 6< len(shape) <15; shape is Ellipse
- else; shape is circle

Code

import cv2
import numpy as np

# Load the image
img = cv2.imread("D:/downloads/cont_new.png")
# Convert to greyscale
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Convert to binary image by thresholding
_, threshold = cv2.threshold(img_gray, 245, 255, cv2.THRESH_BINARY_INV)
# Find the contours
contours, _ = cv2.findContours(threshold, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# For each contour approximate the curve and
# detect the shapes.
for cnt in contours:
    epsilon = 0.01*cv2.arcLength(cnt, True)
    approx = cv2.approxPolyDP(cnt, epsilon, True)
    cv2.drawContours(img, [approx], 0, (0), 3)
    # Position for writing text
    x,y = approx[0][0]

    if len(approx) == 3:
        cv2.putText(img, "Triangle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)
    elif len(approx) == 4:
        cv2.putText(img, "Rectangle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)
    elif len(approx) == 5:
        cv2.putText(img, "Pentagon", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)
    elif 6 < len(approx) < 15:
        cv2.putText(img, "Ellipse", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)
    else:
        cv2.putText(img, "Circle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)
cv2.imshow("final", img)
cv2.waitKey(0)

import cv2

import numpy as np

# Load the image

img = cv2.imread("D:/downloads/cont_new.png")

# Convert to greyscale

img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Convert to binary image by thresholding

_, threshold = cv2.threshold(img_gray, 245, 255, cv2.THRESH_BINARY_INV)

# Find the contours

contours, _ = cv2.findContours(threshold, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

# For each contour approximate the curve and

# detect the shapes.

for cnt in contours:

epsilon = 0.01*cv2.arcLength(cnt, True)

approx = cv2.approxPolyDP(cnt, epsilon, True)

cv2.drawContours(img, [approx], 0, (0), 3)

# Position for writing text

x,y = approx[0][0]

if len(approx) == 3:

cv2.putText(img, "Triangle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)

elif len(approx) == 4:

cv2.putText(img, "Rectangle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)

elif len(approx) == 5:

cv2.putText(img, "Pentagon", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)

elif 6 < len(approx) < 15:

cv2.putText(img, "Ellipse", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)

else:

cv2.putText(img, "Circle", (x, y), cv2.FONT_HERSHEY_COMPLEX, 1, 0,2)

cv2.imshow("final", img)

cv2.waitKey(0)

Below is the final result.

Contour approximation for shape detection

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

An Introduction To The Progressive Growing of GANs

Leave a reply

Generative adversarial networks are famous for generating images. But generating images with high resolution was quite difficult until the introduction of a new training methodology known as the progressive growing of GANs. Progressive growing GANs architecture was proposed by NVIDIA in the paper published in 2017 titled as ” PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION“. This architecture starts with low-resolution images such as 4×4 size and then add up the layers progressively to generate images of high resolution such as 1024×1024.

Traditional GANs were facing some real issues to generate images of high quality. Here I am listing some of the major problems:

Discriminator will be easily able to differentiate b/w real and fake if generated images are large.
Generating such high-quality images also requires large GPU memory due to higher computational cost.
Due to the high memory requirement if we take less batch size it will also make an unstable GAN model.
It was also difficult to produce both large and fine detailed images.

In contrast to these difficulties, progressive growing of GANs removed some obstacles coming in between creating high-quality images. Some of them are:

It reduces training time.
The model becomes more stable since we can train it with a mini-batch of efficient size.

**1024×1024 images generated from Progressive GAN architecture**

Generally, a generative adversarial network consists of two network generator and discriminator. The generator takes a latent vector as input and produces a generated image. And discriminator discriminates between these generated images with original as real vs fake. Training of this model will proceed until images generated from the generator are not able to fool the discriminator half the time. Similarly, progressive GAN architecture consists of both generator and discriminator networks where both networks are a mirror image of each other.

The Network Architecture

Both generator and discriminator starts with very small image 4×4. Original images are transformed into 4×4 size to train the model. Since these images are quite small training will be fast. Once we have fed enough 4×4 images to the discriminator network, we will progressively add up new layers for an 8×8 size image and similarly 16×16 until we reach image resolution to 1024×1024. Nearest neighbor interpolation and average pooling were used for doubling and halving the size of the image. The transition from a 4×4 network to an 8×8 network was done smoothly by fading in new layers.

1. Fading in new Layer

We will see this fading by using an example of the transition from 16×16 to 32×32 resolution images.

In this example, the current resolution of the image is 16×16. Firstly the model will be trained on 16×16 images. The original images are transformed into 16×16 size for training. After training it on a sufficient number of images we will progressively add new layers. In generator nearest neighbor filtering is used to upsample the image while in discriminator average pooling is applied to downsample the image size.

Now to increase the network progressively a residual block for 32×32 resolution is added. During training, this new block is not directly added but faded in. This block consists of two convolution layers and 1 upsampling layer in the generator network. While two convolution layer and one average pooling layer in discriminator network. The new block is multiplied with α and the previous(16×16) is multiplied with (1-α). This α value linearly increases from 0 to 1. Even after fading in of new layers all previous layers in the model will remain trainable.

Similarly, if you want to produce an image of higher resolution more layers will be added progressively. 1×1 convolution layer is added to the last layer in the generator to convert into an RGB image. Similarly, a 1×1 convolution layer is added at top of the discriminator network to get from the RGB image (or generated image).

During the training of progressive GAN, the network starts from 4×4 size and adds up layer progressively to reach the size of 1024×1024. Leaky relu is used for training the model. To train the model it took 4 days on 8 Tesla V100 GPUs.

2. Minibatch Standard Deviation

Generative adversarial networks has a tendency to capture only little variation from training data. Sometimes all input noise vectors generate similar looking images. This problem is also known as ‘mode collapse’. To add a little variation to generated images, authors of the progressive gans have used minibatch standard deviation.

Here standard deviation of each feature in the activation map is calculated and then averaged over the minibatch. Through this new activation, maps are created. These new activation maps are added at the end of the discriminator network.

3. Equalized Learning Rate

In this progressive GAN architecture, authors have not initialized weights carefully. But they are scaling weights dynamically at run time. Here wˆi = w_i/c, where wi are the weights and c is the per-layer normalization constant. In general, due to modern initializer, some parameters have larger dynamic range which causes them to converge later than some other parameters. This can cause both low and high learning rates at the same time. But the equalized learning rate ensures the learning rate the same for all weight parameters.

4. Pixel-wise Feature Vector Normalization in Generator

Generally in the generative adversarial network, batch normalization is used after the convolutional layer. But here in progressive GAN, feature vector in each pixel is normalized to unit length after the convolution layer. Also, this normalization is done only in the generator network, not in discriminator network. This technique prevents the escalation of signal magnitude effectively.

This new architecture with some interesting idea of minibatch standard deviation, equalized learning rate, fading in a new layer, and pixel-wise normalization has shown very promising results. With the help of progressively growing of GAN, the model is able to generate a high-quality image. Also, the training is quite stable. This GAN is able to generate high-resolution images with photo-realistic synthetic images.

Referenced Research Paper: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.