Human pose motion capture is widely used in some applications, for example, computer graphics for movies and games, sports science, and sign language recognition. For this purpose, easy and low-cost methods are needed to capture the human pose motion. One of the main methods is human pose estimation. In recent years, human pose estimation has been actively researched, and deep neural network (DNN) has achieved considerable attention.
In human pose estimation research, RGB or RGB-D cameras are commonly used for input devices that take videos, images, or depth data. The input data are typically taken from the second-person perspective, and the data include approximate parts of the target person’s body. DNN models estimate 2D or 3D joint positions from the input data.
Pose estimation methods for images taken from the first-person perspective is called the “egocentric-view pose estimation.” In the egocentric-view setting, images or depth data are taken by body-attached devices. 3D pose estimation for the egocentric-view inputs is portable and trackable to a specific person. The 3D pose estimation, however, usually captures only limited joints of the human body because the input devices have a bounded angle of view. Therefore, it is extremely difficult for a body-attached camera to obtain images that include enough information for whole joint estimation. Additionally, the dynamic parts of the body (e.g., hands or feet) frequently move out of the camera angle view. These conditions make the estimation difficult.
We intend to apply 3D human pose estimation for sign language recognition and translation. Sign language is the visual communication method used by deaf people across the world; however, each region has different signs, as with oral language. Sign language is composed of some elements: handshapes, movements, positions, facial expressions, and peripheral information. For example, when the index finger points to something in sign language sentences, the meaning changes depending on what it is pointing to.
Some methods have been proposed for sign language recognition and translation from images [1, 2]. Most existing research handle regular images that are taken from the second-person perspective; thus, these approaches must install a camera in front of a signer in use scene. To overcome the restriction, we use a wearable camera for the input device because the system is available everytime, everywhere for signers.
An angle of the wearable camera view is not enough to obtain information for sign language recognition because sign language represents the meaning using a reachable space by hands and peripheral information. For example, a “head” of sign language is represented by pointing head part by index finger. However, if the index finger points to another person, the sign means “you” or “the person.” For the reason above, the capability of capturing whole surrounding view is necessary for the wearable camera in our approach. Additionally, the tracking signer’s pose is also important for sign language recognition because some signs represent the meaning by relative positional relation of body parts. For example, if the index finger points to around own face or chest, the sign means “myself”.
We intend to propose a sign language recognition system using a wearable omnidirectional camera for the input device, which is portable for daily mobile use, and capable of obtaining enough elements for sign language recognition. As a first step for the system, we research 3D human pose estimation models for an RGB image taken by the wearable omnidirectional camera in this paper.
An omnidirectional camera can capture all surrounding information on a plane image, which is converted to an equirectangular image in our setting. We attach an omnidirectional camera to an area in front of a person’s neck to obtain images including sign language elements: the face, hands, and the peripheral environment. Figure 1 shows the omnidirectional camera setting and the equirectangular image taken by the device. The omnidirectional camera closely attached to the human body captures images that have the following characteristics that are different from regular images. Distortion: Objects that are placed around the polar points of the camera are displayed wider than the true image. Disconnection: Objects that are placed on the border are divided into both edges of the image. Therefore, some parts of the human body often do not connect. For instance, when a hand goes out to an edge, the hand comes in from another edge.
Our approach is based on a convolutional neural network (CNN) similar to most of the recent monocular 3D human pose estimation methods. The existing methods, however, cannot apply well to our setting. First, their training data were captured with regular cameras placed at a position in which the cameras can capture almost the whole body from the second-person perspective. Second, most of the existing methods assume a skeletal structure on the image plane when the methods extend 2D joint locations to 3D positions. For the reasons above, their methods fail not only 3D pose estimation but also 2D estimation, which is the basic step for 3D pose estimation, for our distorted and disconnected images. Figure 2 shows the estimation results of an existing 2D pose estimation method and our model.
To overcome these difficulties, we collect a new dataset captured by a wearable omnidirectional camera. More importantly, we introduce the location-maps method that is used to extend 2D joint locations to 3D positions in VNect, which is the 3D human pose estimation model proposed by Mehta et al. [4]. The method does not assume human body structures, and separately derives each x, y, and z position in 3D coordinates from 2D joint locations. Therefore, the location-maps method can reduce the impact of optical properties, which are the distortion and disconnection of equirectangular images. Xu et al. [5] indicated valid results of VNect for distortion images taken by a fish-eye camera placed at a position close to the body.
We validate that the location-maps method has the capability to estimate 3D joint positions with not only distortion but also disconnection caused by the wearable omnidirectional camera. Furthermore, we propose a new estimation model using the location-maps method by replacing the backbone network with a state-of-the-art 2D pose estimation model [6]. Our model is a simpler architecture than VNect, which proposed the location-maps method. In the Section 5, we evaluate our model and VNect in terms of accuracy and computation complexity. In the Section 6, we analyze the location-maps characteristics from two perspectives: the map variance and the map scale.
To the best of our knowledge, our work is the first approach to estimate 3D human poses from an omnidirectional camera closely attached to the body. Although the proposed method is a combination of existing techniques, our work is practical and useful from an application viewpoint. The contributions of this paper are summarized as follows:
-
We collect a new sign language dataset that is composed of equirectangular images and synchronized 3D joint positions. The equirectangular images are taken by a wearable omnidirectional camera in our setting.
-
We propose a new 3D human pose estimation model using the location-maps method for distortion and disconnection images. The model is a simpler architecture than VNect, which is the reference model. Nevertheless, our model’s performance is better with respect to accuracy and computation complexity.
-
We reveal the location-maps characteristics that (1) the map variance affects robustness to extend 2D joint locations to 3D positions for the 2D estimation error, and (2) the 3D position accuracy is related to the 2D locations relative accuracy to the map scale.