 Express Paper
 Open Access
 Published:
Estimating 3D human shape under clothing from a single RGB image
IPSJ Transactions on Computer Vision and Applications volume 10, Article number: 16 (2018)
Abstract
Estimation of naked human shape is essential in several applications such as virtual tryon. We propose an approach that estimates naked human 3D pose and shape, including nonskeletal shape information such as musculature and fat distribution, from a single RGB image. The proposed approach optimizes a parametric 3D human model using person silhouettes with clothing category, and statistical displacement models between clothed and naked body shapes associated with each clothing category. Experiments demonstrate that our approach estimates human shape more accurately than a prior method.
Introduction
We propose a novel approach to estimate threedimensional (3D) human body shapes under clothing from a single RGB image. Human shape is an essential element in the computer vision field. Humans are usually captured with clothing, that is, the actual (naked) human shape is concealed by the clothing and usually differs from the appearance. Virtual tryon systems, an important application of naked shapes, typically superimpose 3D models of clothing onto human shape [1]. Although it allows the consumer to visualize whether an item of clothing suits them with regard to design and color, it is difficult to determine how well the size of the item fits their actual body shape.
The estimation of naked human shape has been studied, but most approaches employ multiview images or 3D scanners (e.g. [2–4]) for acquiring 3D shapes. These approaches are impractical for actualizing shape acquisition for virtual tryon at home, which requires easy input using commodity cameras. Our approach requires the input of only a single image and outputs parameters of 3D human body shape.
The proposed approach is built on a previous singleimage human 3D modeling method (SMPLify [5]), which optimizes a parametric model of 3D human pose and shape [6] to fit joint positions acquired by a joint estimation method [7]. Jointbased optimization does capture aspects of shape information related to human skeletons such as the length of the arms and legs. However, in principle, joint locations do not include nonskeletal information such as musculature and fat distribution.
The proposed method estimates 3D human pose and shape including nonskeletal shape information under clothing. We optimize the parametric 3D human model using a singleimage human silhouette with clothing region segmentation while considering preconstructed statistics of the displacement by clothing (i.e., the distance between the naked and clothed contours) for each clothing category. The displacement modeling is a significant challenge in our study since collecting a dataset of image pairs of clothed and naked people is unfeasible. We, therefore, model the displacement based on clothed person shapes generated from naked silhouettes by a clothing simulator.
3D shape estimation under clothing
Overview
As shown in Fig. 1, the proposed approach optimizes the Skinned MultiPerson Linear (SMPL) [6] pose and shape parameters under clothing, with only a single RGB image as input. Similar to SMPLify [5], our approach optimizes the parameters of a SMPL model [6], which consists of 72 pose parameters (3 orientations for 23 joints + 3 root orientation) and 10 linear shape coefficients. Initially using SMPLify [5] to obtain a jointbased optimization result, we further optimize the parameters using silhouette shape and the clothskin displacement model created for each clothing category.
Clothing segmentation
Given an input image, such as a photograph, we first perform a semantic segmentation to extract both a human silhouette and a clothing category. For this step, we utilize RefineNet [8], a semantic segmentation approach which successfully outputs highresolution results for human part estimation. To train RefineNet, we utilize an image dataset with clothing segmentation, Clothing CoParsing (CCP) dataset [9] and Fashionista dataset [10], where each pixel is labeled by clothing categories. We reclassify the clothing labels into 11 categories: “background,” “skin,” “hair,” “inner wear,” “outer wear,” “skirt,” “dress,” “pants,” “shoes,” “bag,” and “others” and train RefineNet using 1500 images from the dataset.
Clothskin displacement modeling
Modeling the displacement between clothing and skin is a core part of this study. Given the impracticality of collecting a large dataset of pairs of clothed and naked person images, we employ an artificial dataset generated by a conditional variational autoencoder, conditional sketch module (CSM) in [11], as shown in Fig. 2. We create image pairs of clothed and naked person silhouettes by inputting various silhouettes of the SMPL human body to the CSM network. For each image pair of clothed and naked person silhouettes, we compute the displacement on the clothed and naked silhouette contours. We create a distribution of the amount of displacement for each clothing category, based on the category labels of clothed silhouettes. We fit a truncated normal distribution pdf_{c} for the displacement distribution of each clothing category c, using maximumlikelihood estimation. The probability returned by pdf_{c}(d) becomes zero when d is smaller than the lower bound \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{\text {c}}}\), which is optimized via the maximumlikelihood estimation, since naked body contours are never on the exterior of clothing.
Fitting parametric human 3D model
Given a person silhouette with associated clothing category (see Section 2.2), joint locations, and clothskin displacement models (see Section 2.3), the proposed approach estimates the pose and shape through an optimization of SMPL model parameters. The initial SMPL parameters are acquired as the result of a jointbased optimization method, SMPLify [5], where joint locations on the input image are estimated by a CNNbased 2D joint estimation approach, OpenPose [12] trained using MS COCO dataset [13].
Here, SMPL consists of 72dimensional pose (joint angles and root orientation) parameters θ and 10dimensional linear shape coefficients β. The “ideal” pose can change during the optimization of the shape parameters β; we therefore jointly optimize both β and θ.
Optimization
Our objective function for the optimization is defined as follows:
E_{shape} and E_{pose} respectively denote the objective functions for optimizing shape β and pose θ parameters. Lambdas λ_{s}, λ_{c}, λ_{j}, λ_{a}, λ_{sp}, and λ_{θ} are weights for each term. We alternatively minimize the objective terms: minimizing E_{shape}(β) using fixed θ and vice versa.
E_{j}(β,θ), E_{a}(θ), E_{sp}(β,θ), and E_{θ}(θ) are cost terms identical to those utilized in SMPLify. The term E_{j}(β,θ) is a distance between 2D joints on the input image and the joints in the estimated SMPL model projected onto the image plane. For the other terms, refer to [5] for details.
The proposed approach employs cost terms for skin contours E_{s}(β,θ) and clothed contours E_{c}(β,θ). Let \({\mathcal {S}}_{\text {in}}\) be a point set on the person silhouette contour of the input image, which is fixed during the optimization, and \({\mathcal {S}}_{\text {SMPL}}\) be a point set on the corresponding SMPL silhouette contour, which is a variable that depends on β and θ to be optimized. The cost terms utilize nearestneighbor correspondences from \({\mathcal {S}}_{\text {in}}\) to \({\mathcal {S}}_{\text {SMPL}}\),
where \(c\in {\mathcal {C}}\) is a region label for the foreground categories \({\mathcal {C}} = \{ {\text {skin, hair,...}} \}\)^{Footnote 1}. Thus, \({\mathcal {S}}_{\text {in,c}}\subset {\mathcal {S}}_{\text {in}}\) denote input contour points labeled as category c. The mapping function \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p}\in {\mathcal {S}}_{\text {in,c}})\) acquires the nearestneighbor point of \(\mathbf {p} \in {\mathcal {S}}_{\text {in,c}}\) from \({\mathcal {S}}_{\text {SMPL}}\).
Skin contour cost E _{s}
This term controls the behavior of SMPL silhouette contours \({\mathcal {S}}_{\text {SMPL,skin}}\), where the corresponding input contour points \({\mathcal {S}}_{\text {in,skin}}\) are labeled as skin region. The cost term penalizes the 2D Euclidean distance between the corresponding points in \({\mathcal {S}}_{\text {SMPL,skin}}\) and \({\mathcal {S}}_{\text {in,skin}}\):
where \(n_{{\mathcal {S}}_{\text {in}}}\) denotes the number of points in \({\mathcal {S}}_{\text {in}}\), which normalizes the cost.
Cloth contour cost E _{c}
The cost function E_{c} controls the behavior of contours not labeled as “skin” so that the contours located inside the input contour as much as the clothskin displacement described in Section 2.3. Letting \({\mathcal {C}}_{\text {cloth}} = {\mathcal {C}}  \{ \text {skin} \}\), E_{c} is defined as the sum of cost terms for each clothing category, calculated based on contour distances:
Here, let d_{s} denote the signed Euclidean distance between p and \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p})\), where the distance becomes positive if an input contour point p is outside of the contour of the corresponding SMPL contour point \(\text {NN}_{{\mathcal {S}}_{\text {SMPL}}}(\mathbf {p})\). Accordingly, our distance function d_{p} which considering the clothskin displacement is defined as follows:
where ε is a small constant to avoid log(0)=− inf. Here, \(\text {pdf}_{c\in {\mathcal {C}}_{\text {cloth}}}\) denotes the truncated normal distribution modeled in Section 2.3, which returns the probability for a given clothskin displacement but truncated at \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{c}}\). We also define a function when d_{s} is smaller than \(\phantom {\dot {i}\!}\alpha _{\text {pdf}_{c}}\) using the Euclidean distance weighted by λ_{l} to penalize the SMPL contour points outside the input silhouette. In the cost function, d_{p} forms the negative loglikelihood. Therefore E_{c} serves to change the SMPL parameters so that the contour displacement fits the preconstructed displacement model.
Experiments
We performed qualitative and quantitative experiments to unveil the effect of the proposed approach.
Quantitative evaluation
Experimental settings
For evaluation, we used a dataset consisting of timeseries 3D textured human shape acquired by 3D scanners [4], which includes the ground truth shape of the unclothed persons. We utilized 3D videos of four subjects in motion, where each subject wears two clothing variations: (1) Tshirt/long pants and (2) soccer outfit^{Footnote 2}. We generated input images by projecting selected frames from the dataset to the virtual camera of 860 ×860 resolutions, located at the front of the persons. From each 3D video sequence, we sampled five frames for singleimage input. While the dataset provides detailed 3D shapes of human, the proposed approach use SMPL models. As the ground truth 3D models for this experiment, we, therefore, generated SMPL models fitted to the provided shape by minimizing distances between the 3D surface of two models.
We compared the following two approaches:

1
Optimization using joint positions [5] (SMPLify).

2
Optimization using joint, silhouette contours, and clothskin displacement model (proposed).
To evaluate shape accuracy, we translated the estimated SMPL human model using the unit pose, which is the same pose as the ground truth shape provided in the dataset. We measured the accuracy as the average error of the ten body measurements (shown in Fig. 3) in the estimated and the ground truth 3D models. For evaluation, we unified the overall height for each model because the two approaches do not estimate scale information.
Results
Table 1 shows the relative error of shape estimation by each approach. Our approach yielded better accuracy than that of SMPLify. Not only nonskeletal lengths (e.g., chest circumference), we found a few measurements related to human skeletons such as shoulder length are also estimated with higher accuracy than SMPLify.
Visual comparison
While results in the previous section demonstrate the improvements by the proposed approach, the dataset we employed for the quantitative evaluation [4] does not include loose clothing and a variety of body shapes. One important advantage of the proposed approach is the adaptability to a variety of clothing types and body shapes. We, therefore, describe a visual comparison using a variety of photographs collected from fashion photographs.
Figure 4 shows one result of the two approaches described in Section 3.1. In comparison between the jointbased approach (SMPLify) and the proposed approach, jointbased optimization does not produce a body shape that represents musculature and fat distribution.
Conclusions
This paper has described the first approach that estimates human 3D pose and shape, including nonskeletal information from a single RGB image. We model the displacement between clothed and naked contours for each clothing category, using an artificial dataset created by an autoencoderbased image generation method. The proposed approach optimizes a SMPL parametric human model through a likelihoodbased cost function, using a clothskin displacement model, silhouette shape, and joint locations. Through the experiments, the proposed approach more accurately estimated shape coefficients as compared with the jointbased approach [5]. Extension of the proposed approach to unsynchronized multiview input is an interesting and viable research direction.
Notes
 1.
Because contours must belong to foreground regions, \({\mathcal {C}}\) does not include the “background” label.
 2.
The original version of [4] includes five subjects in public, while footage of one subject wearing only a single clothing combination was unused in this experiment.
References
 1
Yuan M, Khan IR, Farbiz F, Yao S, Niswar A, Foo MH (2013) A mixed reality virtual clothes tryon system. IEEE Trans Multimed 15(8):1958–1968.
 2
Balan A, Black MJ (2008) The naked truth: estimating body shape under clothing In: Proc. European Conf. on Computer Vision (ECCV’08), 15–29.. SpringerVerlag Berlin Heidelberg, Marseille.
 3
Song D, Tong R, Chang J, Yang X, Tang M, Zhang JJ (2016) 3D body shapes estimation from dressedhuman silhouettes. Comput Graph Forum 35(7):147–156.
 4
Zhang C, Pujades S, Black MJ, PonsMoll G (2017) Detailed, accurate, human shape estimation from clothed 3D scan sequences In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’17), 4191–4200.. IEEE, Honolulu.
 5
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image In: Proc. European Conf. on Computer Vision (ECCV’16), 561–578.. Springer, Amsterdam.
 6
Loper M, Mahmood N, Romero J, PonsMoll G, Black MJ (2015) SMPL: a skinned multiperson linear model. ACM Trans Graph 34(6):248.
 7
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) DeepCut: joint subset partition and labeling for multi person pose estimation In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’16), 4929–4937.. IEEE, Las Vegas.
 8
Lin G, Milan A, Shen C, Reid I (2017) RefineNet: Multipath refinement networks with identity mappings for highresolution semantic segmentation In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’17), 1925–1934.. IEEE, Honolulu.
 9
Yang W, Luo P, Lin L (2014) Clothing coparsing by joint image segmentation and labeling In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’14), 3182–3189.. IEEE, Columbus.
 10
Yamaguchi K, Kiapour MH, Ortiz LE, Berg TL (2012) Parsing clothing in fashion photographs In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’12), 3570–3577.. IEEE, Providence.
 11
Lassner C, PonsMoll G, Gehler PV (2017) A generative model of people in clothing In: Proc. IEEE Int’l Conf. on Computer Vision (ICCV’17), 853–862.. IEEE, Venice.
 12
Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime multiperson 2D pose estimation using part affinity fields In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’17), 7291–7299.. IEEE, Honolulu.
 13
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context In: Proc. European Conf. on Computer Vision (ECCV’14), 740–755.. Springer International Publishing, Zurich.
Acknowledgements
This work was partly supported by a cooperative research with Daikin Industries, Ltd.
Funding
The research being report in this publication was supported by Daikin Industries, Ltd., as a cooperative research.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. Note the datasets analyzed in the experiments should be obtained from the original authors.
Author information
Author notes
Affiliations
Contributions
YS played the key role in the implementation, experiments, and paper editing. FO conducted the algorithm design and mainly wrote and edited the paper. IM supported the experiments and played an important role in editing the paper. YY played an important role in the research design. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to Fumio Okura.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Human shape modeling
 Parametric 3D human model
 Clothskin displacement modeling