Category Archives: Recent Researches

An Introduction To The Progressive Growing of GANs

Generative adversarial networks are famous for generating images. But generating images with high resolution was quite difficult until the introduction of a new training methodology known as the progressive growing of GANs. Progressive growing GANs architecture was proposed by NVIDIA in the paper published in 2017 titled as ” PROGRESSIVE GROWING OF GANS FOR IMPROVED QUALITY, STABILITY, AND VARIATION“. This architecture starts with low-resolution images such as 4×4 size and then add up the layers progressively to generate images of high resolution such as 1024×1024.

Traditional GANs were facing some real issues to generate images of high quality. Here I am listing some of the major problems:

  1. Discriminator will be easily able to differentiate b/w real and fake if generated images are large.
  2. Generating such high-quality images also requires large GPU memory due to higher computational cost.
  3. Due to the high memory requirement if we take less batch size it will also make an unstable GAN model.
  4. It was also difficult to produce both large and fine detailed images.

In contrast to these difficulties, progressive growing of GANs removed some obstacles coming in between creating high-quality images. Some of them are:

  1. It reduces training time.
  2. The model becomes more stable since we can train it with a mini-batch of efficient size.

Generally, a generative adversarial network consists of two network generator and discriminator. The generator takes a latent vector as input and produces a generated image. And discriminator discriminates between these generated images with original as real vs fake. Training of this model will proceed until images generated from the generator are not able to fool the discriminator half the time. Similarly, progressive GAN architecture consists of both generator and discriminator networks where both networks are a mirror image of each other.

The Network Architecture

Both generator and discriminator starts with very small image 4×4. Original images are transformed into 4×4 size to train the model. Since these images are quite small training will be fast. Once we have fed enough 4×4 images to the discriminator network, we will progressively add up new layers for an 8×8 size image and similarly 16×16 until we reach image resolution to 1024×1024. Nearest neighbor interpolation and average pooling were used for doubling and halving the size of the image. The transition from a 4×4 network to an 8×8 network was done smoothly by fading in new layers.

1. Fading in new Layer

We will see this fading by using an example of the transition from 16×16 to 32×32 resolution images.

In this example, the current resolution of the image is 16×16. Firstly the model will be trained on 16×16 images. The original images are transformed into 16×16 size for training. After training it on a sufficient number of images we will progressively add new layers. In generator nearest neighbor filtering is used to upsample the image while in discriminator average pooling is applied to downsample the image size.

Now to increase the network progressively a residual block for 32×32 resolution is added. During training, this new block is not directly added but faded in. This block consists of two convolution layers and 1 upsampling layer in the generator network. While two convolution layer and one average pooling layer in discriminator network. The new block is multiplied with α and the previous(16×16) is multiplied with (1-α). This α value linearly increases from 0 to 1. Even after fading in of new layers all previous layers in the model will remain trainable.

Similarly, if you want to produce an image of higher resolution more layers will be added progressively. 1×1 convolution layer is added to the last layer in the generator to convert into an RGB image. Similarly, a 1×1 convolution layer is added at top of the discriminator network to get from the RGB image (or generated image).

During the training of progressive GAN, the network starts from 4×4 size and adds up layer progressively to reach the size of 1024×1024. Leaky relu is used for training the model. To train the model it took 4 days on 8 Tesla V100 GPUs.

Source

2. Minibatch Standard Deviation

Generative adversarial networks has a tendency to capture only little variation from training data. Sometimes all input noise vectors generate similar looking images. This problem is also known as ‘mode collapse’. To add a little variation to generated images, authors of the progressive gans have used minibatch standard deviation.

Here standard deviation of each feature in the activation map is calculated and then averaged over the minibatch. Through this new activation, maps are created. These new activation maps are added at the end of the discriminator network.

3. Equalized Learning Rate

In this progressive GAN architecture, authors have not initialized weights carefully. But they are scaling weights dynamically at run time. Here wˆi = wi/c, where wi are the weights and c is the per-layer normalization constant. In general, due to modern initializer, some parameters have larger dynamic range which causes them to converge later than some other parameters. This can cause both low and high learning rates at the same time. But the equalized learning rate ensures the learning rate the same for all weight parameters.

4. Pixel-wise Feature Vector Normalization in Generator

Generally in the generative adversarial network, batch normalization is used after the convolutional layer. But here in progressive GAN, feature vector in each pixel is normalized to unit length after the convolution layer. Also, this normalization is done only in the generator network, not in discriminator network. This technique prevents the escalation of signal magnitude effectively.

This new architecture with some interesting idea of minibatch standard deviation, equalized learning rate, fading in a new layer, and pixel-wise normalization has shown very promising results. With the help of progressively growing of GAN, the model is able to generate a high-quality image. Also, the training is quite stable. This GAN is able to generate high-resolution images with photo-realistic synthetic images.

Referenced Research Paper: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Style Generative Adversarial Network (StyleGAN)

Generative adversarial network( GAN ) generates synthetic images that are indistinguishable from authentic images. A GAN network consists of a generator network and a discriminator network. Generator network tries to generate new images from a noise vector and discriminator network discriminate these generated images from the original dataset. While training the GAN model, the generator network tries to fool the discriminator and discriminator to improve itself to differentiate between real and fake images. This training will continue until the discriminator model is fooled half the time and the generator is not able to generate data similar to original data distribution.

Since the introduction of generative adversarial networks in 2014, there has been many improvements in its architecture. Deep convolutional GAN, semisupervised GAN, conditional GAN, CycleGAN and many more. These variants of GAN mainly focuses on improving the discriminator architecture and the generator model continues to operate as the black box.

The style generative adversarial network proposed an alternative generator architecture that can control the specific features of the output image such as pose, identity, hairs, freckles( when trained on face dataset ) even without compromising the image quality.

Baseline Architecture

The baseline architecture for StyleGAN is taken from another recently introduced GAN variant: Progressive GAN. In progressive GAN, both generator and discriminator grow progressively: starting from low resolution, It adds up layers to the model which can extract very fine details. In progressive GAN images start from 4×4 and generate images up to 1024×1024 size. This progressively growing architecture speeds up and stabilizes the training process which helps in generating such high-quality images.

StyleGAN Architecture

Progressive GAN was able to generate high-quality images but to control the specific features of the generated image was difficult with its architecture. To control the features of the output image some changes were made into Progressive GAN’s generator architecture and StyleGAN was created. Here is the architecture of the generator for the StyleGAN.

Along with the generator’s architecture, the above figure also differentiates between a traditional generator network and a Style-based generator network. To develop StyleGAN’s generator network, there are some modifications done in the progressive GAN. We will discuss these modifications one by one.

1. Removal of Traditional Input Layer

In traditional generator networks, a latent vector is provided through an input layer. This latent vector must follow the probability density of training data which ca leads to some degree of entanglement. Let’s say if training data consist of one type of image greater than other variations, then it can lead to producing images with features more related to that large type of data. So instead of a traditional input layer, the synthesis network( generator network) starts with a 4 × 4 × 512 constant tensor.

2. Mapping Network and AdaIN

Mapping network embeds the input latent code to intermediate latent space which can be used as style and incorporated at each block of synthesis network. As you can see in the above generator’s architecture, latent code is fed to 8 fully connected layers and an intermediate latent space W is generated.

This intermediate latent space W is passed through a convolutional layer “A” (shown in the architecture) and specializes in styles ( y = ( ys , yb )) to transform and incorporate into each block of the generator network. To incorporate this into each block of the generator network, first, the feature maps (xi) from each block are normalized separately and then scaled and biased using corresponding styles. This is also known as adaptive instance normalization (AdaIN).

This AdaIN operation is added to each block of the generator network which helps in deciding the features in the output layer.

3. Bilinear Upsampling

This generator network grows progressively. Usually upsampling in a generator network one uses transposed convolutional network. But here in StyleGAN, it uses bilinear upsampling to upsample the image instead of using the transposed convolution layer.

4. Noise Layers

As you can see in the architecture of the StyleGAN, noise layers are added after each block of the generator network( synthesis network ). This noise consists of uncorrelated Gaussian noise which is first broadcasted using a layer “B” to the shape of feature maps from each convolutional block. Using this addition of noise StyleGAN can add stochastic variations to the output.

There are many stochastic features in the human face like hairs, stubbles, freckles, or skin pores. In traditional generator, there was only a single source of noise vector to add these stochastic variations to the output which was not quite fruitful. But with adding noise at each block of synthesis network in the generator architecture make sure that it only affects stochastic aspects of the face.

5. Style Mixing

This is basically a regularization technique. During training, images are generated using two latent codes. It means two latent codes z1 and z2 are taken to produce w1 and w2 styles using a mapping network. In the Synthesis network a split point is selected and w1 style is applied up to that point and w2 style is applied after that point and the network is trained in this way.

In the synthesis network, these styles are added at each block. Due to this network can assume that these adjacent styles are correlated. But style mixing can prevent the network from assuming these adjacent styles are correlated.

Source

These were the basic changes made in baseline architecture to improve it and create a StyleGAN architecture. Other things like, generator architecture, mini-batch sizes, Adam hyperparameters and moving an exponential average of the generator are the same as baseline architecture.

Summary

StyleGAN has proven to be promising at producing high-quality realistic images also gives control to generate images with particular features. It was clearly seen that traditional generators lag far behind than this improved generator network. Concepts like mapping network and AdaIN can really be very helpful in GAN architecture and other research work.

Referenced Research Paper: 1. A Style-Based Generator Architecture for Generative Adversarial Networks 2. Progressive Growing of GANs for Improved Quality, Stability, and Variation

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Cycle-Consistent Generative Adversarial Networks (CycleGAN)

In this blog, we will learn how to perform an image-to-image translation using CycleGAN. The image-to-image translation is a type of computer vision problem where the image is transformed from one domain to another domain. Let say edges to a photo.

An image-to-image translation generally requires a paired set of images to train a model. We can see this type of translation using conditional GANs. In those cases paired set of images is required. Take a look into paired set of images for translating edges to photo:

But for many cases, collecting paired set of training data is quite difficult. Let say we want an object transfiguration model where we want to translate an image of a horse to an image of zebra and vice versa.

For these types of tasks, even the desired output is not well defined then how we can collect a paired set of images. To solve this problem authors have proposed an approach called CycleGAN to transfer an image from X domain to Y domain without paired set of examples.

Cycle Consistent GAN

A CycleGAN captures special characteristics of one image domain and figures out how these image characteristics could be translated to another image domain, all without paired training examples. Let’s look at some unpaired training dataset.

Problem with these translations: In the case of paired training examples, the network has supervision power with corresponding label images. But in the case of the unpaired training dataset, we need to supervise at a set level where sets are X domain and Y domain. Now to train such network we need to find a mapping G: X → Y such that outputs from G(X) are indistinguishable from the Y domain. The possibility of such G mappings is infinite which does not guarantee meaningful input and output image pairs. Sometimes this type of network causes mode collapse. Mode collapse occurs when all input images map to the same output image.

Cycle Consistent: To cop up with the problem stated above the authors of the paper proposed that translation should be “Cycle Consistent”. For example, if we translate an English sentence to a french sentence and then translate back it to English sentence we should arrive at the original sentence. Similarly, in case of image if we translate image from X domain to Y domain using a mapping G and then again translate this G(X) to X using mapping F we should arrive back at the same image.

So here, CycleGAN consists of two GAN network. Both of which have a generator and a discriminator network. To train the network it has two adversarial losses and one cycle consistency loss. Let’s see its mathematical formulation.

Mathematical Formulation of CycleGAN

Let say we are having two image domains X and Y. Now our model includes two mappings G: X → Y and F: Y → X. And we are having two adversarial losses DX and DY. DX will discriminate between F(Y) and X domain images. Similarly, DY will discriminate between G(X) and Y domain images. We will also have a cycle consistency loss to prevent a contradiction between learned mapping G and F.

In above figure (a), you can see the two different mappings G and F. Also figure (b) and (c) defines the forward cycle consistency loss ( x → G(x) → F(G(x)) ≈ x ) and backward consistency loss ( y → F(y) → G(F(y)) ≈ y ) respectively.

Network Architecture

There are two different architectures each for generator and discriminator network.

Generator network follows encoder-decoder architecture with three main parts:

  1. Encoder
  2. Transformer
  3. Decoder
Source

The encoder consists of three convolutional layers. An input image is passed through this encoder network and features volumes are taken as output. The transformer consists of 6 residual blocks. It takes feature volumes generated from the encoder layer as input and gives the output. And finally, the decoder layer which works as deconvolutional layers. It takes output from the transformer and generates a new image.

A Discriminator network is a simple network. It takes image as input and predicts whether it is part of real dataset or fake generated image dataset.

Source

This discriminator network is basically a patchGAN. A patchGAN is a simple convolutional network whereas the only difference is instead of mapping the input image to single scalar output, it maps input image to an NxN array output. Every individual in NxN output maps to a patch in the input image. In cycleGAN, it maps to 70×70 patches of the image. Finally, we take the mean of this output and optimize it to find the real of fake image. The advantage of using a patchGAN over a normal GAN discriminator is, it has fewer parameters than normal discriminator also it can work with arbitrary sized images.

Loss Function

Adversarial loss is applied to both mapping G and F with adversarial losses as DX and DY. These discriminator losses makes sure that the model is trained to generate data indistinguishable from real data for both image domains.

Adversarial losses alone can not guarantee that learned function can map individual input x to desired output y. Thus we need to use cycle consistency loss also. Cycle consistency loss makes sure that the image translation cycle is able to bring back x to the original image, i.e., x → G(x) → F(G(x)) ≈ x. Now full loss can be written as follows:

L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F)

First, two arguments in the loss function are adversarial losses for both mappings. The last parameter is for cycle consistency loss. λ here defines the importance of the respective loss. Originally authors have used it as 10.

CycleGAN has produced compelling results in many cases but it also has some limitations. That’s all for CycleGAN introduction. In the next blog we will implement this algorithm in keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Image to Image Translation Using Conditional GAN

The image-to-image translation is a well-known problem in the field of image processing, computer graphics, and computer vision. Some of the problems are converting labels to street scenes, labels to facades, black&white to a color photo, aerial images to maps, day to night and edges to photo. Take a look into these conversions:

Earlier each of these tasks is performed separately. But with the help of convolutional neural networks (CNNs), communities are taking big steps in this field. Because of CNN, most of the work is automatic as we train the model in an end to end fashion. But still, we need to define a loss function that tries to achieve the target we want. Most of us take the loss function lightly but this is the most important thing that you should always give your attention to when training deep learning models. For instance, if we take euclidean distance as our loss function for image-to-image translation, it would produce blurred images because it minimizes by averaging all outputs. Thus we need a meaningful loss function corresponding to each task and this is something that is always painful. This is where the generative adversarial network (GAN) comes.

GANs learn a loss that tries to classify if the output image is real or fake, while simultaneously training a generative model to minimize this loss. Blurry images will not be tolerated since they look obviously fake. Because GANs learn a loss that adapts to the data, they can be applied to a multitude of tasks that traditionally would require very different kinds of loss functions.

Now with the help of GANs, we can generate a realistic-looking image. But in image-to-image translation, we do not just want to generate a realistic-looking image but also output image should be translated from the input image. To perform this type of task we need a conditional GAN, so you must first understand this before moving forward (To know in detail about conditional GAN you can follow this blog).

In image-to-image translation with conditional GAN, the generator is provided with the input image and a noise vector both. Now generator will generate an image that is translated from the input image and indistinguishable from original data (Discriminator will be fooled). To train this model we need some paired training examples as shown below:

Network Architecture

Here the network architecture consists of two models, generator and discriminator. First, take a look into the generator model.

Generally, a generator network in GAN architecture takes noise vector as input and generates an image as output. But here input consists of both noise vector and an image. So the network will be taking image as input and producing an image as output. In these types of problems generally, an encoder-decoder model is being used.

In an encoder-decoder network, first, the input is being down-sampled till a bottleneck layer and then upsampled to generate image again. In our problem of image-to-image translation, input and output differ in surface appearance but both have the same structure. So to make this encoder-decoder network-rich, the low-level information is shared between the input and output. For this, skip connections are added which forms an U-net architecture as shown in the above figure.

Here the discriminator model is a patchGAN. A patchGAN is nothing but a conv net. The only difference is that instead of mapping an input image to a single scalar vector, it maps to an NxN array. Where each individual element in NxN array maps to a patch in the input image. Finally, averaging is done to find the full input image is real or fake.

Reason for using patchGAN: The generator model is being trained using discriminator loss and also the L1 loss. It is well known that L1 losses produce blurry images. L1 losses fail to capture high frequencies in images while in many cases they are able to capture low frequencies. Now the task for discriminator will be only to capture high frequency. By straining the model’s attention to local image patches using patchGAN, it clearly helped in capturing high frequencies in the image.

Loss Function

Generally, loss function for a conditional GAN can be stated as follows:

Here generator G tries to minimize this loss function whereas discriminator D tries to maximize it. In the paper, authors have coupled it with L1 loss function such that the generator task is to not only fool the discriminator but also to generate ground truth near looking images. So final loos function would be:

Paper has suggested that this is a really promising approach in many image-to-image translation tasks but it always requires a paired training dataset which is sometimes difficult to get. That’s all for this blog, in the next blog we will implement its application (pix2pix) using keras.

Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Image-to-Image Translation with Conditional Adversarial Networks

Efficient and Accurate Scene Text Detector (EAST)

Before the introduction of deep learning in the field of text detection, it was difficult for most text segmentation approaches to perform on challenging scenarios. Conventional approaches use manually designed features while deep learning methods learn effective features from training data. These conventional approaches are usually multi-staged which ends with slightly lesser overall performance. In this blog, we will learn a deep learning-based algorithm (EAST) that detects text with a single neural network with the elimination of multi-stage approaches.

Introduction

The EAST algorithm uses a single neural network to predict a word or line-level text. It can detect text in arbitrary orientation with quadrilateral shapes. In 2017 this algorithm outperformed state of the art methods. This algorithm consists of a fully convolutional network with a non-max suppression (NMS) merging state. The fully convolutional network is used to localize text in the image and this NMS stage is basically used to merge many imprecise detected text boxes into a single bounding box for every text region (word or line text).

EAST Network Architecture

The EAST architecture was created while taking different sizes of word regions into account. The idea was to detect large word regions that require features from the later stage of the neural network while detecting small word regions that require low-level features from initial stages. To create this network, authors have used three branches combining into a single neural network.

EAST

1. Feature Extractor Stem

This branch of the network is used to extract features from different layers of the network. This stem can be a convolutional network pretrained on the ImageNet dataset. Authors of EAST architecture used PVANet and VGG16 both for the experiment. In this blog, we will see EAST architecture with the VGG16 network only. Let’s see the architecture of the VGG16 model.

VGG16

For the stem of architecture, it takes the output from the VGG16 model after pool2, pool3, pool4, and pool5 layers.

2. Feature Merging Branch

In this branch of the EAST network, it merges the feature outputs from a different layer of the VGG16 network. The input image is passed through the VGG16 model and outputs from different four layers of VGG16 are taken. Merging these feature maps will be computationally expensive. That’s why EAST uses a U-net architecture to merge feature maps gradually (see EAST architecture figure). Firstly, outputs after the pool5 layer are upsampled using a deconvolutional layer. Now the size of features after this layer would be equal to outputs from the pool4 layer and both are then merged into one layer. Then Conv 1×1 and Conv 3×3 are applied to fuse the information and produce the output of this merging stage.

Similarly outputs from other layers of the VGG16 model are concatenated and finally, a Conv 3×3 layer is applied to produce the final feature map layer before the output layer.

3. Output Layer

The output layer consists of a score map and a geometry map. The score map tells us the probability of text in that region while the geometry map defines the boundary of the text box. This geometry map can be either a rotated box or quadrangle. A rotated box consists of top-left coordinate, width, height and rotation angle for the text box. While quadrangle consists of all four coordinates of a rectangle.

Note: For more details on the Optical Character Recognition, please refer to the Mastering OCR using Deep Learning and OpenCV-Python course.

Loss Function

The loss function used in this EAST algorithm consists of both score map loss and geometry loss function.

As you can see in the above formula, both losses are combined with a weight λ. This λ is for giving importance to different losses. In the EAST paper, authors have used it as 1.

Non-max Suppression Merging Stage

Predicted geometries after fully convolutional network are passed through a threshold value. After this thresholding, remaining geometries are suppressed using a locality aware NMS. A Naive NMS runs in O(n2). But to run this in O(n), authors adopted a method which uses suppression row by row. This row by row suppression also takes into account iteratively merging of the last merged one. This makes this algorithm fast in most of the cases but the worst time complexity is still O(n2).

This was all about the Efficient and Accurate Scene Text algorithm. In the next blog, we will implement this algorithm using its GitHub Repository. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: EAST: An Efficient and Accurate Scene Text Detector

Connectionist Text Proposal Network (CTPN)

Nowadays thousands of organizations worldwide rely on optical character recognition (OCR) systems to extract machine-readable text from printed paper documents. These OCR systems are widely used in various applications such as ID cards reading, automatic data entry from documents, number plate recognition from vehicles, etc.

Text localization is an important aspect of building such OCR systems. In this blog, we will learn a deep learning algorithm to localize text in an image.

Introduction

CTPN algorithm refers to the connectionist text proposal network. This name is given to the algorithm because it detects text lines in a sequence of fine text proposals. If you are thinking about what are these fine text proposal, don’t worry, we will discuss about text proposals in detail later in this blog. This CTPN algorithm is an end to end trainable deep learning model. This algorithm is also really helpful in localizing extremely ambiguous text.

There are many problems associated with text localization in natural scene images. Some of them are a Highly cluttered background, large variance in the text pattern, occlusions in image, distortion, and orientation of the text.

To overcome these challenges researchers are working for many years. There are two basic approaches. One is the conventional approach and the other is modern deep learning approaches which also include the CTPN algorithm.

The conventional approaches consist of a multi-stage pipeline. These algorithms basically follow bottom-up approaches. They start with low-level character detection and then follow multi-stages such as non-text component filtering, then text line construction and verification. These approaches heavily rely on every stage in their pipeline. But in deep learning, we can cut off these multi-stages into end-to-end trainable models.

Researchers also tried to use object detection algorithms like faster R-CNN to detect text in an image. But these object detection algorithms are difficult to apply in scene text detection due to the requirement of more accurate localization.

CTPN Algorithm

Now we will look into the CTPN algorithm in detail. First, we will see all the stages in the following CTPN network architecture and then see them in detail.

  1. Firstly input image is passed through a pretrained VGG16 model (trained with ImageNet dataset).
  2. Features output from the last convolutional maps of the VGG16 model is taken.
  3. These outputs are passed through a 3×3 spatial window.
  4. Then outputs after a 3×3 spatial window are passed through a 256-D bi-directional Recurrent Neural Network (RNN).
  5. The recurrent output is then fed to a 512-D fully connected layer.
  6. Now comes the output layer which consists of 3 different outputs, 2k vertical coordinates, 2k text/non-text scores and k side refinement values.

VGG Network

CTPN uses a pretrained VGG16 model shown above. The algorithm takes the output from the last convolutional maps. And the output feature size depends on the size of the input images. Also during the training of the CTPN model, the parameters of the first two convolutional maps are fixed and rest are trained.

3×3 Spatial Window and Recurrent Layer

A single small 3×3 spatial window is passed through outputs from the VGG network to extract useful features. Since textual data is also considered as sequential data, it is beneficial to use a recurrent neural network. After that, a fully connected layer is used to produce the next output layer.

Output Layer

The first output consists of 2k vertical coordinates, where k is the number of anchor boxes. Every anchor box output contains its y coordinate for the center of the box and height of the box. These anchor boxes are fine-scale text proposals whose width is 16 pixels shown in the diagram.

A total of 10 anchor boxes are taken whose heights vary from 11 to 273 pixels.

The second outputs are 2k text/non-text scores. For each anchor box, the output layer also contains text/non-text scores. It includes one output for classification between foreground and background and another output is for the positive or negative anchor. The positive or negative anchor is being decided on the basis of the IOU overlap with the Ground Truth box.

The third outputs are k side-refinements. In CTPN we fix the width of fine-scale text proposal to 16 pixels but this can be problematic in some cases where some side text proposals are discarded due to low score. So in the output layer, it also predicts side refinement values for the x-axis.

Now, you might have got some feeling about CTPN network. In the next blog, we will implement a CTPN algorithm from the GitHub code. Hope you enjoy reading.

If you have any doubts/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Referenced Research Paper: Detecting Text in Natural Image with Connectionist Text Proposal Network

DensePose

Recently, Facebook researchers have released a paper named “DensePose: Dense Human pose Estimation in the Wild”, which establishes dense correspondences from a 2D RGB image to a 3D surface of human body, also in the presence of background, occlusions and scale variations.

DensePose can be understood as group of problems like, object detection, pose estimation, part and instance segmentation. This task can be applied to problems which requires 3D understanding. Some of them are:

  1. Graphics
  2. Augmented Reality
  3. Human-computer interaction
  4. General 3D based object understanding

Till now these tasks are being established by using depth sensors, which are highly costly. Several other works which aims to achieve dense correspondence uses pairs or sets of images. But, DensePose method requires single RGB image as input and only focused on most important visual category, Humans.

For image to surface mapping of humans, recent studies uses two stage method of first detecting joints in body by using a CNN and then fitting a deformable surface model such as SMPL. While DensePose uses end-to-end supervised learning method by collecting ground truth correspondences between RGB images and parametric surface models.

DensePose is inspired from DenseReg framework, which focused mainly on faces. But DensePose focuses on full human body, which has its own challenges due to variation in poses, high flexibility and complexity of body. To address these problems, authors have designed suitable architecture using Mask-RCNN.

This method consists of three stages:

  1. Manually collecting ground-truth datasets.
  2. Using CNN based models on collected datasets to predict dense correspondences.
  3. In-painting the constructed ground truths with a teacher network for better performance.

1. Collection of Dataset

Till now, no manually collected dataset exists, for dense correspondence of real images. Authors, have introduced a COCO-DensePose Dataset, with annotation of 50K humans which is having more than 5 million manually annotated correspondences.

In this task, human annotators are involved to annotate 2D image to 3D surfaces. If it was done by directly annotating it to 3D surface model, it would be cumbersome and very frustrating for annotators. So, Authors have acquired a two stage annotation pipeline and post measures accuracy of human annotators..

In the first stage, authors have delineated visible body parts like head, torso, leg, arms, hands and feet. And then designed these parts to isomorphic to a plane.

To simplify the annotation, authors have divided full body into 24 parts by flattening out the body s shown below.

In the second stage, authors have used k-means to sample maximum of 14 points on each part. Also to simplify this task, annotators are being provided with six pre-renderd views of same body part and asked to annotate in most suitable view of part. Surface coordinates of this annotated point will be used to mark on remaining views. See figure below:

Accuracy of Human Annotators

Here, annotators accuracy is measured over synthetic data. To calculate the accuracy, authors have compared the geodesic distance between true position generate by synthetic data and the one estimated by annotators to bring the synthesized image into correspondence with 3D surface. Geodesic distance is the distance between two vertices with the shortest path. Authors considered two types of evaluation measures to evaluate annotators accuracy.

Pointwise Evaluation: In this approach, geodesic distance is used as threshold for deciding ratio of correct points. With varying threshold, obtained a curve f(t) whose area under curve gives the summary of correspondence accuracy.

Per-instance Evaluation: For this type of evaluation, authors have introduced a geodesic point similarity formula.

With the above formula, GPS is calculated of every person instance on the image. And once this GPS matching score is calculated, they perform average precision and average recall with the GPS thresholding range b/w 0.5 to 0.95.

After performing these evaluations, it is found that annotation errors are greater at back and front part of torso and lesser at head and hand parts.

2. Using CNN based architectures on Generated Dataset

After generating ground truth dataset, it is time to train the deep neural network to predict the dense correspondences. Hence, authors have experimented with both fully convolution network and region based network( like Mask-RCNN), and found latter superior. Authors have combined DenseReg architecture with Mask-RCNN and introduced DensePose-RCNN.

Fully Convolution Network:

In this method they combined a classification and a regression task. In classification task, pixel is classified as either background or one of several body parts. Since we have divided full body into 24 parts, so classification will be 25 class( one background) classification.

Here c* is class with the highest probability in the classification task. After classifying pixel to which class it belongs, then it will do the regression to find U, V parameterization for its exact point in the 3D surface model. Regression is divided into 24 different regression task because each of 24 body parts are treated independently with their local coordinates.

Region Based Network:

After experimenting with fully convolution network, authors have found that it not as fruitful as region based network. For region based networks, they have used exact same architecture of MASK-RCNN till ROIAlign and then used fully convolution network for regression and classification same as DenseReg. This architecture is capable to work at 25 fps for 320X240 images and at 5 fps for 800×1100 images.

3. Ground-truth Interpolation using In-painting Network

During annotation, in every training sample annotators have annotated only a sparse set of pixel around 100-150. This does not hamper training because at the time of calculation of per-pixel loss, they have not included those pixels which are not annotated. But they found improved performance if they interpolated annotations of other pixel with the help of annotated pixels as shown below:

To do this , authors have trained a teacher network. They only focused on interpolation points on humans and ignored the background to reduce the background error.

Things to be noticed through this paper is a large scale dataset of ground-truth images for image to surface correspondence and combination of MASK R-CNN with DenseReg architecture to predict surface correspondences. This paper can pave a way to further reasearch and development in this field and can be boon to 3D modelling field.

Referenced Research paper: DensePose: Dense Human Pose Estimation in the Wild

Github : DensePose

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

EAT-NAS: Elastic Architecture Transfer for Neural Architecture Search

Recently, Jiemin Fang et. al. , has published a paper that introduces a method to accelerate the neural architecture search named as “elastic architecture transfer for accelerating large-scale neural architecture search“. In this blog we will learn what is neural architecture search, what are the limitations associated with it and how this paper is overcoming those limitations.

Neural Architecture Search

Neural architecture search, as its name suggests, is a method to automatically search the best network architecture for a given problem. If you have worked on neural networks, you may have encountered with the problem of selecting best hyperparameters for the the network i.e. which optimizer to select , what learning rate to use, how many layers to add and so on. To solve this problem different methods have been aroused like, evolutionary search, reinforcement learning, gradient based-optimization etc.

A neural architecture search method is not fully automated, as they rely on human designed architecture at the starting point. These methods consists of three components:

  1. Search Space: A well designed search space in which our method will search best parameters
  2. Search Method: Which method to use like reinforcement learning, evolutionary search, etc.
  3. evaluation strategy: Which parameter is used to find best architecture model.

Problem: In any neural architecture search method it requires a large amount of computational cost. Even with the recent advancement in this field, it still requires a lot of GPU days to find best architecture.

EAT-NAS

To solve the problem of computational cost, current studies first search architectures in small datasets and then directly apply to large datasets. But applying architectures searched for small datasets directly to large dataset does not guarantee performance in large datasets. To solve this problem, authors of EAT-NAS have introduced an elastic architecture transfer method to accelerate neural architecture search.

How EAT-NAS works:

In this method, architectures are first searched on small dataset, and the best one is selected as basic architecture for large dataset. This basic architecture is then transferred elastically to large dataset to accelerate search process on large dataset as shown in the figure below.

Authors have searched architecture on CIFAR-10 dataset and then elastically transferred it to large imagenet dataset. Let’s see the whole EAT-NAS process.

  1. Framework: First search for a top performing architecture on CIFAR-10. Here it is MobilenetV2.
  2. Search Space: A well-designed search space is required which is consist of five elements ( conv operation, kernel size, skip connection, width and depth factor)
  3. Population Quality: Selection of top performing model depends on its quality which is decided by the mean and variance of the accuracy of models.
  4. Architecture Scale Search: It also searches the width factor denoting the expansion ratio of filter number and depth factor denoting the number of layers per block( in selected MobilenetV2).
  5. Offspring Architecture Generator: After transferring basic architecture to large dataset, a generator take this architecture as initial seed. Then a transformation function is applied to this model to generate best architecture for large dataset.

In the above figure, the upper one is the basic architecture searched on CIFAR-10 and transferred elastically to search architecture for Imagenet dataset( lower one). It takes 22 hours on 4 GPUs to search the basic architecture on CIFAR-10 and 4 days on 8 GPUs to transfer to ImageNet. Which is quite less compared to other methods used for neural architecture search. Also, they have achieved 73.8 % accuracy on the imagenet dataset which surpasses accuracy achieved by architectures searched from scratch on imagenet dataset.

Referenced Research Paper: EAT-NAS

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Single Image Super-Resolution Using a Generative Adversarial Network

In recent years, the neural network has produced various breakthroughs in different areas. One of its promising results can be seen in super-resolving an image at large up-scaling factors as shown below

Isn’t it difficult to produce a high resolution image from a low resolution image?

In the paper, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, authors have used Generative Adversarial Network for super-resolution and are capable of producing photo-realistic natural images for 4x up-scaling factors.

In the paper, authors have used generative adversarial network (GAN) to produce single image super resolution from a low resolution image. In this blog we will see the followings:

  1. Architecture of GAN used in the paper.
  2. Loss function used for this problem.

Adversarial Network Architecture used in paper:

In the paper they have also used one discriminator and one generator model. Here, the generator is being fed with LR images and tries to generate images which are difficult to classify from real HR images by the discriminator.

Source

Generator Network: Input LR image is passed with 9*9 kernels with 64 filters and ParametricReLU. Then B residual blocks are being applied and each block is having 3*3 kernel with 64 filters followed by batch normalization and ParametricReLU. Then two sub-pixel convolution layers are applied to up-sample image to 4x.

Discriminator Network:  There is also a discriminator which will discriminate real HR image from generated SR image. It contains eight convolutional layers with an increasing number of 3 × 3 filter kernels, increasing by a factor of 2 from 64 to 512 kernels. To reduce the image resolution, strided convolutions are applied each time the number of features are doubled. The resulting 512 feature maps are followed by two dense layers and a final sigmoid activation function to obtain a probability for real or fake image.

Loss Function: In the paper, authors have defined a perceptual loss function which consists of content loss and an adversarial loss function. 

Adversarial loss tries to train generator such that it produces natural looking images which will be difficult for discriminator to distinguish from real image. In addition, they used a content loss motivated by perceptual similarity.

For content loss, mean squared error is the most widely used loss function. But it often results in perceptual unsatisfying content due to over smoothing of content. To resolve this problem authors of the paper use a loss function that is closer to perceptual similarity. They defined the VGG loss based on the ReLU activation layers of the pre-trained 19 layer VGG network.

They performed experiments on Set5, Set14 and BSD100 and tested it on BSD300 and achieved promising results. To test the results obtained by SRGAN authors have also taken mean opinion score of 26 rates. And they found results look much similar to original images.

Referenced Research Paper: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

GitHub: Super Resolution Examples

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Dimensionality Reduction for Data Visualization using Autoencoders

In the previous blog, I have explained concept behind autoencoders and its applications. In this blog we will learn one of the interesting practical application of autoencoders.

Autoencoders are the neural network that are trained to reconstruct their original input. But only reconstructing original input will be useless. The main purpose is to learn interesting features using autoencoders. In this blog we will see how autoencoders can be used to learn interesting features to visualize high dimensional data.

Let say if you are having a 10 dimensional vector, then it will be difficult to visualize it. Then you need to convert it into 2-D or 3-D representation for visualization purpose. There are some famous algorithms like principal component analysis that are used for dimensionality reduction. But if you implement an autoencoder that only uses linear activation function with mean squared error as its loss function, then it will end up performing principal component analysis.

Here we will visualize a 3 dimensional data into 2 dimensional using a simple autoencoder implemented in keras.

3-dimensional data

Autoencoder model architecture for generating 2-d representation will be as follows:

  1. Input layer with 3 nodes.
  2. 1 hidden dense layer with 2 nodes and linear activation.
  3. 1 output dense layer with 3 nodes and linear activation.
  4. Loss function is mse and optimizer is adam.

The following code will generate a compressed representation of input data.

Here is the generated 2-D representation of input 3-D data.

Compressed Representation

In the similar way you can visualize high dimensional data into 2-Dimensional or 3-Dimensional vectors.

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.