- Express Paper
- Open Access
- Published:

# Learning 3D joint constraints from vision-based motion capture datasets

*IPSJ Transactions on Computer Vision and Applications*
**volume 11**, Article number: 5 (2019)

## Abstract

Realistic estimation and synthesis of articulated human motion must satisfy anatomical constraints on joint angles. A data-driven approach is used to learn human joint limits from 3D motion capture datasets. We represent joint constraints with a new formulation (*s*_{1},*s*_{2},*τ*) using swing-twist representation in exponential maps form. Our parameterization is applied on Human3.6M dataset to create the lookup-map for each joint. These maps enable us to generate ‘synthetic’ datasets in entire joint rotation space of a given joint. A set of neural network discriminators is then trained with synthetic datasets to learn valid/invalid joint rotations. The discriminators achieve accuracy of [94.4−99.4*%*] for different joints. We validate precision-accuracy trade-off of discriminators and qualitatively evaluate classified poses with an interactive tool. The learned discriminators can be used as ‘priors’ for human pose estimation and motion synthesis.

## Introduction

3D human pose estimation from monocular images using kinematic models [1–3] has produced state of art *robust* results compared to direct 2D-to-3D regression models [4, 5]. An enormous collection of human motion capture (mocap) datasets are now publicly available to train deep neural networks. Recently, [6] has even employed a synthetic human motion dataset for better training of deep neural networks. But, the synthesis of *realistic* human motion for long run is an open problem [7].

Often, a *‘penalty term’* is used for exceeding joint limits, but these joint limits are worked out from a limited dataset [8, 9]. To the best of our knowledge, no recent attempt has been reported to obtain more realistic and statistically valid joint limits from publicly available mocap datasets.

The artifacts and outliers in the mocap data can be pruned using joint constraints inferred directly from respective datasets. The imposition of accurate joint constraints as *priors* is valuable for realistic 3D human pose estimation and motion synthesis. A data-driven learning of joint constraints from mocap datasets is a challenging task. The presence of outliers and voids due to sparse data makes it imperative to choose a representation of joint space, which is robust to such artifacts. Moreover, the instances of class representing *invalid joint rotations* are missing in existing motion capture datasets. One-class learning on these dataset, using positive-unlabelled (PU) algorithms has associated challenges. Instead, we *synthesize* both valid and invalid samples for balanced learning.

The main contributions of our paper are (i) set of neural network discriminators to classify valid and invalid rotations for different joints (ii) a new formulation of swing-twist (*s*_{1},*s*_{2},*τ*) representation to represent joint rotations (iii) discretized look-up map based on above parameterization to synthesize labels in a given joint space.

The overview of the current work for the left shoulder joint is depicted in the Fig. 1. Our three-parameter representation is unambiguous and free of anomalies, which makes it easier to learn a joint rotation manifold. Thus, neural network discriminators trained on our representation yield good accuracy and qualitative precision with very few parameters.

## Related works

The joint constraints are dependent on the choice of representation for joint rotations. A straightforward approach to apply joint constraints in terms of Euler angles is to assume minimum and maximum limits on each angle (box-model) [10]. However, this approach is too crude to capture realistic anatomical joint limits. Euler angles representation also leads to discontiguous cluster of valid poses in joint space and often causes the problem of Gimbal lock [11]. Due to these limitations, the joint limits based on box model are not optimal choice. The swing-twist representation in [11–14] of joint rotation is devised to efficiently classify validity of a joint rotation. However, interdependence of swing and twist (intra-joint dependence) has not been satisfactorily addressed in earlier works.

In a data-driven approach, Herda et al. [15] has proposed a parameterization in terms of an implicit surface on joint trajectory data in quaternion space. The authors extend their method to deal with inter-joint dependencies^{Footnote 1}, but it is computationally expensive and is not intuitive to interpret the joint limits. Further, the imposed limits are directly on captured data, which might lead to unrealistic limits especially in the presence of voids and sparse mocap data. Also, only the vector part (sin*θ*) of quaternion is used in [15] to represent joint limits. It clearly ignores information in scalar part (cos*θ*) leading to ambiguity in rotation angle of joint. The quaternion also has dual representation (+q and −q) for a given joint rotation. A preprocessing step is often necessary over motion trajectory to mitigate the effects of duality.

Recently, Akhter et al. [9] learned pose-conditioned joint angle limits of trained gymnasts and athletes from a more controlled mocap dataset. Their work encodes only swing of a body segment in a discretized binary occupancy map. The occupancy map is defined using local spherical coordinates of the joint on a unit sphere. A table-look-up function for validity of pose is conditioned for every child joint based on its immediate parent in a kinematic chain while ignoring the intra-joint dependency^{1}.

Aforementioned limitations of previous works form the motivation of our research. We use an exponential map parameterization instead to represent joint rotations in present work. The exponential map representation avoid singularities as well as non-contiguous clusters [10]. It also avoids the duality problem in representation as described earlier.

We derive our joint constraints using more comprehensive Human 3.6 Million(H36M) dataset [16]. To overcome the limitation of intra-joint dependency, an additional parameter representing twist is learned. We encode lower and upper bounds of twist *τ* around a joint axis on occupancy map for a given swing (*s*_{1},*s*_{2}).

## Proposed methodology

The constraints of a 3D human joint rotation for routine activities are learned by using Human3.6M (H36M) dataset [16]. Our joint constraints parameterization is based on swing-twist formulation. Initially, we decompose the joint rotation (expressed in quaternion form) into swing and twist parts. Later, we express each of the swing and twist in exponential map and arrive at (*s*_{1},*s*_{2},*τ*).

In comparison, Akhter et al. [9] learn joint rotation constraints from a motion capture dataset based on extreme range of joint movements. Moreover, they ignore twist around axis as they parameterize joint rotation from local 3D Cartesian coordinates to the spherical coordinates (*ϕ*,*θ*) representing swing of body segment. In current work, we compare our joint limits (swing part) derived from a range of normal activities performed by human subjects (in H36M) with extreme range limits provided by [9].

### Swing-twist decomposition

The decomposition of joint rotation into orthogonal components, i.e., swing and twist is achieved by defining the twist w.r.t an axis of intrinsic (rotated) frame of reference. For example, if *y*-axis of reference frame is aligned with child limb, the swing-twist decomposition of a rotation in terms of quaternions is expressed in the following way: The quaternion *q*:=(*q*_{s},*q*_{x},*q*_{y},*q*_{z}) can be decomposed as *q*_{twist}∗*q*_{swing}, where *q*_{twist}:=(*c*_{y},0,*s*_{y},0),*q*_{swing}:=(*c*_{zx},*s*_{x},0,*s*_{z}). Using the norm constraint of unit quaternion and following [13] we obtain,

Then, we introduce *η*_{twist}:=(0,*τ*,0) and *η*_{swing}:=(*s*_{1},0,*s*_{2}) as respective exponential map counter-parts, and get,

Using the Eq. (2), we get swing components (*s*_{1},*s*_{2}) and the twist component *τ* of a joint rotation in exponential map. This representation is convenient as it (i) is unambiguous (no duality), (ii) avoids non-contiguous regions/jumps in joint trajectory, and (iii) requires no additional constraints like unit norm. Compared with swing-twist decomposition in quaternion form [13], it requires less parameters (3 parameters in *η*_{twist},*η*_{swing} versus 5 in *q*_{twist} and *q*_{swing}). Consequently, it provides optimal joint representation for learning valid/invalid poses.

## Implementation

The Euler angle representation of joint angles in H36M and [9] dataset are transformed into (*s*_{1},*s*_{2},*τ*) representation, as described in previous section. The resulting joint rotation space is continuous, contiguous and well-behaved for discriminative learning (Fig. 2).

It is found that considerable outliers are present in mocap dataset provided by Akhter et al. [9]. For instance, left and right knee flexion of all subjects performing various activities is shown in Fig. 3. It clearly shows an impractical knee extension (upward) of up to −20^{∘} for certain motion sequences in dataset. Moreover, we believe that limited set of mocap data of [9] is not enough to learn inter-joint dependency, i.e., insufficient instances of swing of the child limb exists conditioned upon a prior swing of parent limb.

The comparison of swing limits for right hip joint (right up leg) from both datasets in swing space (*s*_{1},*s*_{2}) is illustrated in Fig. 2. The swing limits obtained using H36M although more restrictive but are true representative of human motion involved in routine activities. Also, these limits lie well within the extreme range of motion of [9]. Therefore, we restrict our learning only to H36M dataset.

### Look-up maps for synthetic dataset

Let \(\mathbb {M}\) be the set of all samples of H36M dataset for joint J, represented in (*s*_{1},*s*_{2},*τ*) space. We define a 2D grid (*s*^{i},*s*^{j}) on swing subspace, where *s*^{i},*s*^{j}∈{− 180,…,180} and a discrete mapping as follows

Now for each grid square (*s*^{i},*s*^{j}), if at least one mocap sample \(\in \mathbb {M}\) lies in it, we term it as valid swing bin. The discretized swing occupancy map is thus defined by

With each valid occupancy bin, we further associate minimum and maximum bounds of *τ*, based on all the instances of mocap data occuring in that bin

In order to obtain smooth twist angle bounds, we apply 3 × 3 Gaussian kernel(s) on (*s*_{1},*s*_{2},*τ*_{min}) and (*s*_{1},*s*_{2},*τ*_{max}) surfaces. Now given any (*s*_{1},*s*_{2},*τ*), we define a validity function in the following way

We generate synthetic ground truth of labeled samples in (*s*_{1},*s*_{2},*τ*) space of a joint J by uniform random sampling on *s*_{1},*s*_{2},*τ*∈[− 180,180] and using isValid() function defined in Eq. 6.

### Training discriminators

A large portion is occupied by invalid instances in complete joint rotation space (for any given joint). Therefore, the sets representing valid and invalid classes are imbalanced. We perform random sampling of valid and invalid instances to obtain 25*%*/75*%* sets respectively. The sampled sets are equally spaced ‘labeled’ training instances in entire joint space. An example of such sampled joint space for right hip joint (right up leg) is as shown in Fig. 4.

A balanced (50%/50%) subset is selected from these sampled valid/invalid instances using random sampling. The subset is again divided into training, validation, and test samples in a ratio (70*%*,15*%*,15*%*). For comparison, we also tested over-sampling of valid class and under-sampling of invalid class to obtain balanced subset independently. However, the sampling proposed above leads to least variation of performance, when evaluation is performed on multiple test sets. The neural network discriminator are trained for each joint, using an equal number of valid/invalid instances in each batch. We train an ensemble of 20 discriminators for each joint to evaluate performance.

We also qualitatively evaluate the ability of neural network to discriminate valid/invalid joint rotations by means of an interactive validation tool. The tool also helps us to ascertain any anomalous examples found in motion capture datasets contradictory to the learned joint limits.

## Results and discussion

The learning of a smooth implicit function by neural network represents the valid range of rotation for a specific joint, i.e, valid swing and twist component of joint rotation.

We achieve over 95% accuracy (on all joints discriminators) using only one hidden layer in a fully connected neural network (FCN) (results are shown in Table 1). The table reports the result obtained for best among 20 classifier trained for each joint. The number of optimal hidden nodes obtaining best result is also reported for each joint in Table 1. A choice of 4, 8, 16, and 32 hidden nodes is tested. The optimal number of hidden nodes is decided when performance plateau is reached. The network with eight hidden nodes is found optimal for most of the joints.

In our view, this is only feasible due to our choice of joint rotation parameterization, which is free of ambiguities and discontinuities over the joint rotation manifold. The discretization scheme proposed on swing map (*s*_{1},*s*_{2}) for complete range of *τ* rotation is also more intuitive. It avoids the limitations mentioned by [15] in his earlier work on quaternion-based joint limits. The representation of joint manifold in quaternion space and Euler angle space is further highlighted in Figs. 5 and 6 respectively. The anomalies of both representations can be noted versus our proposed representation shown in Fig. 4. The discretization of joint space in quaternion and Euler angles to label invalid region is not straightforward as evident from these figures. Therefore, no further analysis is offered in this paper.

The output of discriminator also shows that learning is smooth and effectively accounts for outliers in the data. The corresponding outliers in *synthetic* dataset which are learned as invalid by discriminator are shown in Fig. 7. However, the accuracy on ‘synthetic’ ground truth is not an absolute figure of merit, due to voids and sparse data. Hence, the precision of joint rotation classification in a given pose is further evaluated qualitatively using an interactive tool. The output of our interactive pose validation tool is as depicted in Fig. 8. There was no improvement found in precision (vs accuracy trade-off) as well with further increase in number of neurons or hidden layers for joint discriminators. As a pilot study, we perform only limited qualitative tests to check veracity of our method. The detailed evaluation with multiple participants to validate synthetic invalid poses and evaluate discriminator performance will be undertaken in extended work.

The *intra-joint* (parent-child) dependency of range of rotation in a hierarchy of joints [9, 15] is implicitly learned in our approach. Our approach emphasizes the coupling of parent-child arising from twist itself. For example, a constraint on the twist of upper limb avoids *unrealistic* pose of the lower limb by twist propagation along the kinematic chain. Thus, presented with any (*s*_{1},*s*_{2},*τ*) parameterized joint rotation, our trained neural network(s) are able to decide for the valid or invalid joints independently.

The Table 2 shows a discriminative comparison matrix on test poses of left shoulder using Akhter et al. [9] and our trained network. All poses classified as valid by our network are also found valid by [9]. However, since [9] does not take into account twist, the joint rotations classified as invalid by our network on account of twist are still valid in [9]. Figure 9 further highlights that Akhter et al. isValid() function allows unrealistic swing of left shoulder as almost whole (90%) of swing space is classified as valid. On the contrary, valid region in swing space learned by our discriminator is more plausible. These two factors account for high instances (36.65%) of left shoulder poses classified as invalid by our network while still being classified as valid by [9] in Table 2.

## Conclusion

The articulated movement of human body is constrained by the limits on 3D joint rotations. We present a formulation of swing and twist components of joint rotations in exponential maps to learn joint limits by data-driven approach. Our representation is used to encode valid poses in an intuitive and physically meaningful way. A discretized look-up map is derived to label valid/invalid rotations in parametrized joint space. And we train a set of neural network discriminators to classify valid and invalid rotations for different joints. Our trained discriminators used as priors will lead to accurate learning of 3D human pose estimation and motion synthesis. The learned joint constraints are useful to represent normal range of human motion especially vital for social affordance applications like vision-based human-robot interaction. Data-driven approach also provides meaningful limits to monitor the rehabilitation of patient’s with joint injuries. Since the learning is performed using a data-driven approach, the quality of learned constraints depends on the range of motion available in the mocap data. Therefore, we would be extending the current method by capturing more data and using other publicly available motion capture datasets. Our visualization tool can guide collection of motion data close to constraints. A more comprehensive qualitative study to validate synthetic invalid poses with the help of human evaluation would also result in better ground truth generation.

## Notes

Inter-joint dependency is defined between the parent joint and the subsequent children joint angles in a kinematic chain, while intra-joint dependency exists between swing and twist of the same joint.

## References

Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose In: Proceedings of the IEEE Conference on Computer Vision and Pattern Regognition (CVPR): 19-21 June 2018; Salt Lake City Utah, USA, 7122–7131.. IEEE.

Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel H-P, Xu W, Casas D, Theobalt C (2017) Vnect: Real-time 3d human pose estimation with a single RGB camera. ACM Trans Graph (TOG) 36(4):44.

Zhou X, Sun X, Zhang W, Liang S, Wei Y (2016) Deep kinematic pose regression In: European Conference on Computer Vision (ECCV): 8-16 October 2016; Amsterdam, Netherlands, 186–201.. Springer, Cham.

Park S, Hwang J, Kwak N (2016) 3d human pose estimation using convolutional neural networks with 2d pose information In: European Conference on Computer Vision (ECCV): 8-16 October 2016; Amsterdam, Netherlands, 156–169.. Springer, Cham.

Tekin B, Marquez Neila P, Salzmann M, Fua P (2017) Learning to fuse 2d and 3d image cues for monocular body pose estimation In: Proceedings of the IEEE International Conference on Computer Vision (ICCV): 22-29 October 2017; Venice, Italy, 3941–3950.. IEEE.

Varol G, Romero J, Martin X, Mahmood N, Black MJ, Laptev I, Schmid C (2017) Learning from synthetic humans. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 21-26 July 2017; Hawaii, United States:109–117. IEEE.

Pavllo D, Grangier D, Auli M (2018) QuaterNet: A Quaternion-based Recurrent Model for Human Motion In: British Machine Vision Conference (BMVC): 3-6 September 2018; Newcastle, UK, 299.. BMVA.

Elhayek A, Kovalenko O, Murthy P, Malik J, Stricker D (2018) Fully automatic multi-person human motion capture for vr applications In: EuroVR International Conference on Virtual Reality and Augmented Reality: 22–23 October 2018; London, UK, 28–47.. Springer, Cham.

Akhter I, Black MJ (2015) Pose-conditioned joint angle limits for 3d human pose reconstruction In: Proceedings of the IEEE Conference on Computer Vision and Pattern Regognition (CVPR): 8-10 June 2015; Boston Massachusetts, USA, 1446–1455.. IEEE.

Engell-Nørregård M, Erleben K (2009) Estimation of joint types and joint limits from motion capture data In: International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG): 2-5 February 2009; Plzen, Czech Republic, 9–16.. Eurographics.

Baerlocher P, Boulic R (2001) Parametrization and range of motion of the ball-and-socket joint. In: Thalmann NM Thalmann D (eds)Deformable Avatars, 180–190.. Springer.

Shao W, Ng-Thow-Hing V (2003) A general joint component framework for realistic articulation in human characters In: Proceedings of the 2003 symposium on Interactive 3D graphics (I3D): 27 - 30 April 2003; Monterey California, USA, 11–18.. ACM.

Dobrowolski P (2015) Swing-twist decomposition in clifford algebra. arXiv: Computing Research Repository (CoRR) 1506.05481. https://arxiv.org/abs/1506.05481v1.

Grassia FS (1998) Practical parameterization of rotations using the exponential map. J Graph Tools 3(3):29–48.

Herda L, Urtasun R, Fua P, Hanson A (2003) Automatic determination of shoulder joint limits using quaternion field boundaries. Int J Robot Res 22(6):419–436.

Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36(7):1325–1339.

## Acknowledgements

The authors would like to acknowledge the support provided by Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Kaiserslautern, Germany and National University of Sciences of Technology (NUST), Pakistan for this work.

### Funding

The research in this publication was supported under VIDETE Project with Grant number 01IW18002 by Federal Ministry of Education and Research, Germany and Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Kaiserslautern, Germany.

### Availability of data and materials

Human3.6 dataset [16] used in this research is publicly available at http://vision.imar.ro/human3.6m/

## Author information

### Authors and Affiliations

### Contributions

PM processed H36M dataset and analyzed different joint orientation representation to synthesize look-up maps. HTB worked on the swing-twist formulation and performed training of discriminators. Both authors have equal contribution in writing the manuscript. SH contributed toward improvements in mathematical formalism and structure of paper. AK has designed the visualization tool under input and evaluation from DS. DS also made valuable contribution in reviewing the final manuscript. All authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## About this article

### Cite this article

Murthy, P., Butt, H.T., Hiremath, S. *et al.* Learning 3D joint constraints from vision-based motion capture datasets.
*IPSJ T Comput Vis Appl* **11, **5 (2019). https://doi.org/10.1186/s41074-019-0057-z

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s41074-019-0057-z

### Keywords

- Human pose estimation
- Motion capture
- Joint constraint