Recently, Facebook researchers have released a paper named “DensePose: Dense Human pose Estimation in the Wild”, which establishes dense correspondences from a 2D RGB image to a 3D surface of human body, also in the presence of background, occlusions and scale variations.

DensePose can be understood as group of problems like, object detection, pose estimation, part and instance segmentation. This task can be applied to problems which requires 3D understanding. Some of them are:

  1. Graphics
  2. Augmented Reality
  3. Human-computer interaction
  4. General 3D based object understanding

Till now these tasks are being established by using depth sensors, which are highly costly. Several other works which aims to achieve dense correspondence uses pairs or sets of images. But, DensePose method requires single RGB image as input and only focused on most important visual category, Humans.

For image to surface mapping of humans, recent studies uses two stage method of first detecting joints in body by using a CNN and then fitting a deformable surface model such as SMPL. While DensePose uses end-to-end supervised learning method by collecting ground truth correspondences between RGB images and parametric surface models.

DensePose is inspired from DenseReg framework, which focused mainly on faces. But DensePose focuses on full human body, which has its own challenges due to variation in poses, high flexibility and complexity of body. To address these problems, authors have designed suitable architecture using Mask-RCNN.

This method consists of three stages:

  1. Manually collecting ground-truth datasets.
  2. Using CNN based models on collected datasets to predict dense correspondences.
  3. In-painting the constructed ground truths with a teacher network for better performance.

1. Collection of Dataset

Till now, no manually collected dataset exists, for dense correspondence of real images. Authors, have introduced a COCO-DensePose Dataset, with annotation of 50K humans which is having more than 5 million manually annotated correspondences.

In this task, human annotators are involved to annotate 2D image to 3D surfaces. If it was done by directly annotating it to 3D surface model, it would be cumbersome and very frustrating for annotators. So, Authors have acquired a two stage annotation pipeline and post measures accuracy of human annotators..

In the first stage, authors have delineated visible body parts like head, torso, leg, arms, hands and feet. And then designed these parts to isomorphic to a plane.

To simplify the annotation, authors have divided full body into 24 parts by flattening out the body s shown below.

In the second stage, authors have used k-means to sample maximum of 14 points on each part. Also to simplify this task, annotators are being provided with six pre-renderd views of same body part and asked to annotate in most suitable view of part. Surface coordinates of this annotated point will be used to mark on remaining views. See figure below:

Accuracy of Human Annotators

Here, annotators accuracy is measured over synthetic data. To calculate the accuracy, authors have compared the geodesic distance between true position generate by synthetic data and the one estimated by annotators to bring the synthesized image into correspondence with 3D surface. Geodesic distance is the distance between two vertices with the shortest path. Authors considered two types of evaluation measures to evaluate annotators accuracy.

Pointwise Evaluation: In this approach, geodesic distance is used as threshold for deciding ratio of correct points. With varying threshold, obtained a curve f(t) whose area under curve gives the summary of correspondence accuracy.

Per-instance Evaluation: For this type of evaluation, authors have introduced a geodesic point similarity formula.

With the above formula, GPS is calculated of every person instance on the image. And once this GPS matching score is calculated, they perform average precision and average recall with the GPS thresholding range b/w 0.5 to 0.95.

After performing these evaluations, it is found that annotation errors are greater at back and front part of torso and lesser at head and hand parts.

2. Using CNN based architectures on Generated Dataset

After generating ground truth dataset, it is time to train the deep neural network to predict the dense correspondences. Hence, authors have experimented with both fully convolution network and region based network( like Mask-RCNN), and found latter superior. Authors have combined DenseReg architecture with Mask-RCNN and introduced DensePose-RCNN.

Fully Convolution Network:

In this method they combined a classification and a regression task. In classification task, pixel is classified as either background or one of several body parts. Since we have divided full body into 24 parts, so classification will be 25 class( one background) classification.

Here c* is class with the highest probability in the classification task. After classifying pixel to which class it belongs, then it will do the regression to find U, V parameterization for its exact point in the 3D surface model. Regression is divided into 24 different regression task because each of 24 body parts are treated independently with their local coordinates.

Region Based Network:

After experimenting with fully convolution network, authors have found that it not as fruitful as region based network. For region based networks, they have used exact same architecture of MASK-RCNN till ROIAlign and then used fully convolution network for regression and classification same as DenseReg. This architecture is capable to work at 25 fps for 320X240 images and at 5 fps for 800×1100 images.

3. Ground-truth Interpolation using In-painting Network

During annotation, in every training sample annotators have annotated only a sparse set of pixel around 100-150. This does not hamper training because at the time of calculation of per-pixel loss, they have not included those pixels which are not annotated. But they found improved performance if they interpolated annotations of other pixel with the help of annotated pixels as shown below:

To do this , authors have trained a teacher network. They only focused on interpolation points on humans and ignored the background to reduce the background error.

Things to be noticed through this paper is a large scale dataset of ground-truth images for image to surface correspondence and combination of MASK R-CNN with DenseReg architecture to predict surface correspondences. This paper can pave a way to further reasearch and development in this field and can be boon to 3D modelling field.

Referenced Research paper: DensePose: Dense Human Pose Estimation in the Wild

Github : DensePose

Hope you enjoy reading.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Leave a Reply