DensePose proposes a set of 50 000 annotations to establish dense correspondences from 2D images to surface-based representations of the human body. If done naively, this would require by manipulating a surface through rotations - which can be frustratingly inefficient. Instead, we construct a two-stage annotation pipeline to efficiently gather annotations for image-to-surface correspondence.

In the first stage we ask annotators to delineate regions corresponding to visible, semantically defined body parts. We instruct the annotators to estimate the body part behind the clothes, so that for instance wearing a large skirt would not complicate the subsequent annotation of correspondences.

In the second stage we sample every part region with a set of roughly equidistant points and request the annotators to bring these points in correspondence with the surface. In order to simplify this task we `unfold' the part surface by providing six pre-rendered views of the same body part and allow the user to place landmarks on any of them. This allows the annotator to choose the most convenient point of view by selecting one among six options instead of manually rotating the surface. We use the SMPL model and SURREAL textures in the data gathering procedure.

Sample annotated image:

Associated Paper or Article

More information can be found by reading DensePose: Dense Human Pose Estimation In The Wild.

As this dataset is part of two challenges, there are two means of downloading the data, either through DensePose-COCO Track or DensePose-Pose Track.

One model is associated with this dataset, and that is the DensePose-RCNN System, as described below:

We adopt the architecture of Mask-RCNN with the Feature Pyramid Network (FPN) features, and ROI-Align pooling so as to obtain dense part labels and coordinates within each of the selected regions. As shown below, we introduce a fully-convolutional network on top of the ROI-pooling that is entirely devoted to two tasks:

Generating per-pixel classification results for selection of surface part. For each part regressing local coordinates within part.

During inference, our system operates at 25fps on 320x240 images and 4-5fps on 800x1100 images using a GTX1080 graphics card.

The DensePose-RCNN system can be trained directly using the annotated points as supervision. However, we obtain substantially better results by ``inpainting'' the values of the supervision signal on positions that are not originally annotated. To achieve this, we adopt a learning-based approach where we firstly train a ``teacher'' network: A fully-convolutional neural network(depicted below) that reconstructs the ground-truth values given images scale-normalized images and the segmentation masks.

We further improve the performance of our system using cascading strategies. Via cascadning, we exploit information from related tasks, such as keypoint estimation and instance segmentation, which have successfully been addressed by the Mask-RCNN architecture. This allows us to exploit task synergies and the complementary merits of different sources of supervision.

No benchmarks have been provided for this dataset.

Associated Challenges

This dataset was associated to two challenges, on COCO and Mapillary Joint Recognition Challenge Workshop at ICCV 2019.

The dataset is lincenced under the CC BY-NC 2.0 licence.

DensePose: Dense Human Pose Estimation In The Wild

Overview

Associated Paper or Article

Download

Model

Benchmarks

Associated Challenges

Licence