2.1 Overview
As shown in Fig. 1, the proposed approach optimizes the Skinned Multi-Person Linear (SMPL) [6] pose and shape parameters under clothing, with only a single RGB image as input. Similar to SMPLify [5], our approach optimizes the parameters of a SMPL model [6], which consists of 72 pose parameters (3 orientations for 23 joints + 3 root orientation) and 10 linear shape coefficients. Initially using SMPLify [5] to obtain a joint-based optimization result, we further optimize the parameters using silhouette shape and the cloth-skin displacement model created for each clothing category.
2.2 Clothing segmentation
Given an input image, such as a photograph, we first perform a semantic segmentation to extract both a human silhouette and a clothing category. For this step, we utilize RefineNet [8], a semantic segmentation approach which successfully outputs high-resolution results for human part estimation. To train RefineNet, we utilize an image dataset with clothing segmentation, Clothing Co-Parsing (CCP) dataset [9] and Fashionista dataset [10], where each pixel is labeled by clothing categories. We re-classify the clothing labels into 11 categories: “background,” “skin,” “hair,” “inner wear,” “outer wear,” “skirt,” “dress,” “pants,” “shoes,” “bag,” and “others” and train RefineNet using 1500 images from the dataset.
2.3 Cloth-skin displacement modeling
Modeling the displacement between clothing and skin is a core part of this study. Given the impracticality of collecting a large dataset of pairs of clothed and naked person images, we employ an artificial dataset generated by a conditional variational auto-encoder, conditional sketch module (CSM) in [11], as shown in Fig. 2. We create image pairs of clothed and naked person silhouettes by inputting various silhouettes of the SMPL human body to the CSM network. For each image pair of clothed and naked person silhouettes, we compute the displacement on the clothed and naked silhouette contours. We create a distribution of the amount of displacement for each clothing category, based on the category labels of clothed silhouettes. We fit a truncated normal distribution pdfc for the displacement distribution of each clothing category c, using maximum-likelihood estimation. The probability returned by pdfc(d) becomes zero when d is smaller than the lower bound \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{\text {c}}}\), which is optimized via the maximum-likelihood estimation, since naked body contours are never on the exterior of clothing.
2.4 Fitting parametric human 3D model
Given a person silhouette with associated clothing category (see Section 2.2), joint locations, and cloth-skin displacement models (see Section 2.3), the proposed approach estimates the pose and shape through an optimization of SMPL model parameters. The initial SMPL parameters are acquired as the result of a joint-based optimization method, SMPLify [5], where joint locations on the input image are estimated by a CNN-based 2D joint estimation approach, OpenPose [12] trained using MS COCO dataset [13].
Here, SMPL consists of 72-dimensional pose (joint angles and root orientation) parameters θ and 10-dimensional linear shape coefficients β. The “ideal” pose can change during the optimization of the shape parameters β; we therefore jointly optimize both β and θ.
2.4.1 Optimization
Our objective function for the optimization is defined as follows:
$$\begin{array}{*{20}l} E_{\text{shape}}(\beta) &=&\lambda_{\mathrm{s}}E_{\mathrm{s}}(\beta,\theta)+\lambda_{\mathrm{c}}E_{\mathrm{c}}(\beta,\theta), \end{array} $$
(1)
$$\begin{array}{*{20}l} E_{\text{pose}}(\theta) &= & \lambda_{\mathrm{s}}E_{\mathrm{s}}(\beta,\theta)+\lambda_{\mathrm{c}}E_{\mathrm{c}}(\beta,\theta)+\lambda_{\mathrm{j}}E_{\mathrm{j}}(\beta,\theta) \\ && + \lambda_{\mathrm{a}}E_{\mathrm{a}}(\theta)+\lambda_{\text{sp}}E_{\text{sp}}(\beta,\theta)+\lambda_{\boldsymbol{\theta}}E_{\boldsymbol{\theta}}(\theta). \end{array} $$
(2)
Eshape and Epose respectively denote the objective functions for optimizing shape β and pose θ parameters. Lambdas λs, λc, λj, λa, λsp, and λθ are weights for each term. We alternatively minimize the objective terms: minimizing Eshape(β) using fixed θ and vice versa.
Ej(β,θ), Ea(θ), Esp(β,θ), and Eθ(θ) are cost terms identical to those utilized in SMPLify. The term Ej(β,θ) is a distance between 2D joints on the input image and the joints in the estimated SMPL model projected onto the image plane. For the other terms, refer to [5] for details.
The proposed approach employs cost terms for skin contours Es(β,θ) and clothed contours Ec(β,θ). Let \({\mathcal {S}}_{\text {in}}\) be a point set on the person silhouette contour of the input image, which is fixed during the optimization, and \({\mathcal {S}}_{\text {SMPL}}\) be a point set on the corresponding SMPL silhouette contour, which is a variable that depends on β and θ to be optimized. The cost terms utilize nearest-neighbor correspondences from \({\mathcal {S}}_{\text {in}}\) to \({\mathcal {S}}_{\text {SMPL}}\),
$$ {\mathcal{S}}_{\text{SMPL,c}} = \bigcup_{\mathbf{p}}\left(\text{NN}_{{\mathcal{S}}_{\text{SMPL}}} \left(\mathbf{p} \in{\mathcal{S}}_{\text{in,c}} \right) \right), $$
(3)
where \(c\in {\mathcal {C}}\) is a region label for the foreground categories \({\mathcal {C}} = \{ {\text {skin, hair,...}} \}\)Footnote 1. Thus, \({\mathcal {S}}_{\text {in,c}}\subset {\mathcal {S}}_{\text {in}}\) denote input contour points labeled as category c. The mapping function \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p}\in {\mathcal {S}}_{\text {in,c}})\) acquires the nearest-neighbor point of \(\mathbf {p} \in {\mathcal {S}}_{\text {in,c}}\) from \({\mathcal {S}}_{\text {SMPL}}\).
2.4.1.1 Skin contour cost E
s
This term controls the behavior of SMPL silhouette contours \({\mathcal {S}}_{\text {SMPL,skin}}\), where the corresponding input contour points \({\mathcal {S}}_{\text {in,skin}}\) are labeled as skin region. The cost term penalizes the 2D Euclidean distance between the corresponding points in \({\mathcal {S}}_{\text {SMPL,skin}}\) and \({\mathcal {S}}_{\text {in,skin}}\):
$$\begin{array}{@{}rcl@{}} E_{\mathrm{s}} = \frac{1}{ n_{{\mathcal{S}}_{\text{in}}}} \sum_{\mathbf{p} \in {\mathcal{S}}_{\text{in,skin}}} || \mathbf{p} - \text{NN}_{{\mathcal{S}}_{\text{SMPL}}}(\mathbf{p}) ||, \end{array} $$
(4)
where \(n_{{\mathcal {S}}_{\text {in}}}\) denotes the number of points in \({\mathcal {S}}_{\text {in}}\), which normalizes the cost.
2.4.1.2 Cloth contour cost E
c
The cost function Ec controls the behavior of contours not labeled as “skin” so that the contours located inside the input contour as much as the cloth-skin displacement described in Section 2.3. Letting \({\mathcal {C}}_{\text {cloth}} = {\mathcal {C}} - \{ \text {skin} \}\), Ec is defined as the sum of cost terms for each clothing category, calculated based on contour distances:
$$ E_{\mathrm{c}} = \frac{1}{n_{{\mathcal{S}}_{\text{in}}}} \sum_{c\in {\mathcal{C}}_{\text{cloth}}} \sum_{\mathbf{p} \in {\mathcal{S}}_{\text{in,c}}} d_{\mathbf{p}}. $$
(5)
Here, let ds denote the signed Euclidean distance between p and \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p})\), where the distance becomes positive if an input contour point p is outside of the contour of the corresponding SMPL contour point \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p})\). Accordingly, our distance function dp which considering the cloth-skin displacement is defined as follows:
$$\begin{array}{@{}rcl@{}} && d_{\mathbf{p}} = \\ && \left\{ \begin{array}{ll} -\log \left(\text{pdf}_{c} \left(d_{\mathrm{s}} \right)+\epsilon\right) & \left(d_{\mathrm{s}} \geq \alpha_{\text{pdf}_{\text{c}}}\right)\\ \lambda_{l}(\alpha_{\text{pdf}_{c}} - d_{\mathrm{s}})-\log(\text{pdf}_{c}(\alpha_{\text{pdf}_{c}}) + \epsilon) & \left(d_{\mathrm{s}} < \alpha_{\text{pdf}_{c}}\right)\\ \end{array} \right., \end{array} $$
(6)
where ε is a small constant to avoid log(0)=− inf. Here, \(\text {pdf}_{c\in {\mathcal {C}}_{\text {cloth}}}\) denotes the truncated normal distribution modeled in Section 2.3, which returns the probability for a given cloth-skin displacement but truncated at \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{c}}\). We also define a function when ds is smaller than \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{c}}\) using the Euclidean distance weighted by λl to penalize the SMPL contour points outside the input silhouette. In the cost function, dp forms the negative log-likelihood. Therefore Ec serves to change the SMPL parameters so that the contour displacement fits the pre-constructed displacement model.