Skip to main content

MultiQ: single sensor-based multi-quality multi-modal large-scale biometric score database and its performance evaluation


Single sensor-based multi-modal biometrics is a promising approach that offers simple system construction, low cost, and wide applicability to real situations such as CCTV footage-based criminal investigations. In multi-modal biometrics, fusion at the score-level is a popular and promising approach, and data qualities that affect the matching score of each modality are often incorporated as a quality-dependent score-level fusion framework. This paper presents a very large-scale single sensor-based multi-quality multi-modal biometric score database called MultiQ Score Database version 2 to advance the research into evaluation, comparison, and benchmarking of score-level fusion approaches using both quality-independent and quality-dependent protocols. We extracted gait, head, and height modalities from the OU-ISIR Gait Database and introduce spatial resolution (SR), temporal resolution (TR) and view as quality measures that significantly affect biometric system performance. We considered seven and 10 scaling factors for SR and TR, respectively, with four view variations. We then constructed a database comprising approximately 4 million genuine and 7.5 billion imposter score databases. To evaluate this database, we set two different protocols, and provided a set recognition accuracy for state-of-the-art approaches using protocols for both quality-independent and quality-dependent schemes. This database and the evaluation results will be beneficial for score-level fusion research. Additionally, we provide detailed analysis of the recognition accuracies associated with gait, head, and height modalities in different spatial/temporal resolutions and views. These analyses may be useful in criminal investigation research.

1 Introduction

Biometrics technology recognizes a person based on their physiological and/or behavioral traits [1], including their DNA, fingerprint, palmprint, finger veins, face, signature, and gait. Biometrics has been widely used in applications such as access control, security, surveillance, and forensics [2, 3]. Biometric systems that use a single biometric trait for recognition are called uni-modal biometric systems, and are regarded as mainstream biometric system approaches because of their simple configurations and high usability. These systems are, however, commonly affected by practical problems such as noisy sensor data, non-universality, and vulnerability to spoofing attacks [4]. One solution to these problems is to use multiple biometric traits for recognition, and systems of this type are known as multi-modal biometric systems [5]. A wide variety of biometric modality combinations have been discussed for recognition in the literature, including fingerprint and iris [6], iris and face [7], face and ocular [8], face and gait [9, 10], and face with gait and height [11].

The most important issue in multi-modal biometrics is how to fuse these individual modalities. Fusion of multiple modalities can be performed at various different levels, including sensor, feature, score, rank, and decision levels. In sensor-level fusion, the raw data that are acquired from multiple sensors are integrated (i.e., by mosaicing, where multiple individual 2D images are combined to generate a single image); in feature-level fusion, the features that are extracted from multiple modalities are concatenated into a single feature vector and are then used for classification; in score-level fusion, the multiple matching scores are fused into a single score and a decision is then made on the basis of the fused score; in decision-level fusion, multiple binary outputs are merged to produce a final decision using a specified rule such as a majority voting rule. Among these methods, score-level fusion is the most popular approach because it offers reasonable recognition accuracy and comparatively simple implementation, and thus many researchers in the multi-modal research community work in the score-level fusion field [1217]. These fusion approaches are summarized in Table 1.

Table 1 Fusion approaches at the various multi-modal biometric system levels

In a biometric system, some of the auxiliary information or quality measures [18] that are associated with the biometric samples affect the system’s discrimination capabilities, even though they do not provide the capability to identify the subject by themselves. Many quality measures that significantly affect the recognition accuracy have been reported in the literature to date. These quality measures mainly fall into two families: sample-based quality, and sensor-based quality. Examples of sample-based quality measures include the degree of occlusion and/or blur, and texture richness in iris recognition [19, 20]; the signal-to-noise ratio (SNR) for speech recognition [21]; brightness, contrast, and illumination for fingerprint and face recognition [20]; and the view for face and gait biometrics [22]. Corresponding examples for sensor-based quality include image size or spatial resolution (SR) for image-based biometric systems such as face and fingerprint recognition systems [20], and frame rate (i.e., temporal resolution) for video-based biometrics such as gait recognition systems [23].

Quality measures play an important role in score-level fusion systems and have been used to improve recognition accuracy [18, 24, 25]. In the majority of these approaches, the appropriate quality measures are calculated independently for each modality; this is a natural approach because each modality sample is captured using an individual and independent sensor, and the sensor’s properties are a major factor in determining the sample quality. For example, in multi-modal biometrics with face and fingerprint modalities, face and fingerprint samples are captured using a digital camera and a fingerprint scanner, respectively, and thus the fingerprint scanner never affects the quality of the face modality. In contrast, a single sensor-based system captures multiple modality samples using a single common sensor, and the qualities of multiple modalities are therefore affected by the same sensor property. For example, face and gait images captured simultaneously using a single camera [26] are both affected by the same sensor property (e.g., the SR of the sensor affects both the face and the gait samples, and it also affects the matching scores as a result). In general, correlation of matching scores that originate from different modalities are not so high in multiple sensor-based multi-modal biometrics, but this is not true for single sensor-based multi-modal biometrics, because the same sensor property affects the sample qualities of multiple modalities, and these qualities then affect the matching scores.

In the multi-modal research community, there are few multi-quality benchmark score databases; even Poh et al. [20] generated a score database with modality-specific quality measures (e.g., brightness and contrast for the face modality, with texture richness for fingerprint and iris modalities) and some general image quality measures, but this was a multiple sensor-based score database. Unlike this multiple sensor-based score database, we considered a single sensor-based multi-modal score database in this study, where the qualities commonly affect all the modality scores that can be captured by a single sensor.

With the growing trend for machine-learning approaches, a large-scale dataset is essential to enable efficient training of the model, and it is also required for statistically reliable performance evaluation. Additionally, in density estimation for the likelihood ratio test (e.g., in a forensic case), a large-scale database enables more accurate estimation of the densities, which are dependent on the relatively large score databases.

In this paper, we introduce a very-large-scale multi-quality multi-modal biometric score database to advance the state-of-the-art of quality-dependent score-level fusion research. More specifically, we consider single sensor-based multi-modal biometrics from gait, head, and height modalities and focus on the sensor-based qualities of spatial resolution (SR), temporal resolution (TR) and sample-based quality of view, which all significantly affect the scores of the individual biometric modalities. The contributions made by the present paper are as follows.

  1. 1.

    A benchmark database is constructed and is composed of gait, head, and height biometric scores, including three types of qualities (SR, TR, and view) drawn from a publicly available database: the OU-ISIR Gait Database, Large Population Dataset [27]. Additionally, our database contains a very large number of scores because of the large numbers of subjects and qualities that are included, with totals of 3,908,128 genuine scores and 7,468,432,608 imposter scores, and thus will serve as a benchmark for score-level fusion approaches for the multi-modal biometrics research community.

  2. 2.

    We provide two protocols (i.e., quality dependent and independent) and have conducted a performance evaluation using a variety of score-level fusion approaches, including transformation-based approaches [12], classification-based approaches [16, 28], probability density-based approaches [1315] and some of the quality-dependent versions of these approaches. The results of this performance evaluation are intended for use as a baseline for future research on quality-dependent score-level fusion.

The score database that was constructed with detailed protocol settings is publicly available1 for research purpose. An earlier preliminary version of this work was published in [29] and the extensions from it can be summarized in the following points:

  • We constructed our database by introducing the multi-view concept, whereas the previously released database [29] considered only the side-view, and discussed use of the view as a quality measure; in addition, we applied some pre-processing steps on the feature level and have called this database “MultiQ Score Database version 2”.

  • We have added an analysis of how the quality measures affect the recognition accuracy of each modality.

  • We have added the RankSVM as a new benchmark to enable further discussion of the evaluation results. Consideration of the RankSVM enables analysis of the recognition accuracy in greater depth.

2 Related work

2.1 Existing score database

Initially, researchers developed multi-modal fusion approaches using chimeric datasets because collection of multi-modal data that were captured from the same subjects is a laborious task. Because data of different modalities should be independent of each other, researchers believe that the properties of chimeric datasets are similar to those of real data sets. Therefore, a set of multiple modality features that are captured from different subjects are used as a set of features from a single virtual subject. For example, a face sample from one person and a fingerprint sample from another person are used as a multi-modal feature set for a virtual subject [30]. While these types of databases are accepted to a degree in score-level fusion research, serious questions remain as to whether this was the correct thing. Poh et al. demonstrated experimentally that the performance of a database of real multi-modal users is not equivalent to that of a database of chimeric users [31], while Wayman mentioned that multi-biometric data may necessarily be correlated [32], and thus the use of chimeric databases should be avoided.

Recently, several real multi-modal biometric databases and score databases have been released for research purposes. The major real multi-modal score databases that exist at present are summarized in Table 2. These databases are briefly described here.

Table 2 Existing major score databases

The BioSecure DS2 score database [20] was constructed using the desktop scenario dataset of the BioSecure DS2 database [33] with face, fingerprint, and iris modalities. Originally, the desktop scenario dataset contained voice, face, signature, fingerprint, hand, and iris modalities, and data were acquired in a desktop-based office environment. A total of 333 subjects with equal male and female distributions are included in this score database, although the age distribution is somewhat biased; two thirds of the subject’s ages are in the 18–40-years-old range, while the others are over 40 years old. The main characteristic of the BioSecure DS2 score database is that it considers 14 quality measures for face modality; six of these measures are face-related quality measures, and these qualities are set on the basis of face detection (e.g., detection reliability, number of pixels between the eyes, face with or without glasses, rotation in the plane, rotation in depth, and degree of frontal face); the remaining eight measures are related to the image quality. Additionally, the database considers one quality measure for the fingerprint modality (e.g., texture richness), and three quality measures for the iris modality (e.g., texture richness, difference between the iris and pupil diameters, and proportion of the iris used for matching). In [20], multiple evaluations were reported, including quality-dependent, client-specific, and cost-sensitive fusion.

The BA-Fusion score database [34] was built from the XM2VTS database [35] using face and speech modalities. This score database is composed of eight matching scores; five of these scores are related to the face, while the remaining three are related to speech. For face score calculations, multiple feature extractors and multiple classifiers are used, while for speech scores, multiple feature extractors, and only single classifiers are used. This score database is composed of scores from 295 subjects; both genders were included, but the age distributions are biased (in that the subjects are all adults). No quality measures were provided in this database.

The BANCA score database [36] is composed of face and speech modalities from 52 subjects in two groups. This score database was generated using a set of state-of-the-art baseline classifiers along with template-based approaches. While this score database does consider controlled (clean), adverse (under challenging conditions), and degraded scenarios, no quality measures are provided in this database.

The NIST-Multimodal score database [37] is composed of two face and two fingerprint scores from 517 subjects. Two fingerprint scores were obtained by comparing a pair of left index fingers and a pair of right index fingers. Two face scores were generated using two separate face matchers. The numbers of generated genuine and imposter scores are 517 and 517 × 516 = 266,772, respectively. Again, no quality measures were provided in this database.

In contrast to the existing score databases, the proposed MultiQ Score Database version 2 may contain sufficient variation in term of subjects, quality measures, and very-large-scale genuine and imposter scores. The existing major multi-modal score database includes less than 600 subjects, as shown in Table 2, whereas the proposed database contains 1912 subjects with an approximately equal distribution of male and female subjects, and with ages ranging from 2 to 82 years. All the modalities of the proposed database are extracted using only a single sensor while the other databases mentioned above used multiple sensors.

2.2 Score-level fusion

In the score-level fusion of multi-biometric systems, the most important issue is how to fuse the scores of the different modalities. Many fusion techniques have been proposed in the literature to date. As shown in Table 1, these approaches can be classified into three generic categories: (1) transformation-based, (2) classification-based, and (3) probability density-based approaches.

In a transformation-based approach, the scores are usually normalized to a common domain by one of several normalization techniques (e.g., z-normalization [38], F-normalization [39], and EER-normalization [40]) and the normalized scores are then combined. Kittler et al. focused on classifier combination and developed a theoretical framework for classifier combination [12]. They used a sum rule, a product rule, a minimum rule, a maximum rule, a median rule, and majority voting as the basis for the classifier combination scheme.

In the classification-based approach, multiple scores that are derived from multiple matchers are treated as a feature vector and the classifier is then constructed to discriminate genuine scores from imposter scores. A support vector machine (SVM) is one such classifier and the signed distance from the decision boundary is usually regarded as a fused score [16, 41]. Additionally, the ranking SVM (RankSVM) [28] is a well-known extension of the conventional SVM that focuses more on the relative distance between two classes. RankSVM has been used in many research fields, including person re-identification and gait recognition [42]. Because ranking statistics play an important role in identification scenarios [43], RankSVM can be considered to be a promising approach to score-level fusion for identification scenarios.

The probability density-based approach is further divided with respect to two specific aspects: generative vs. discriminative approaches, and parametric vs. nonparametric approaches. Parametric and generative approaches explicitly or implicitly model the distributions for each class separately and subsequently estimate the model parameters from the training data. Nandakumar et al. proposed a framework for a likelihood ratio-based fusion rule and estimated the genuine and imposter distributions in the form of a finite Gaussian mixture model (GMM). It was shown that a likelihood ratio-based approach led to high performance for quality-based biometrics when using three multibiometric databases [15].

In contrast, discriminative approaches model the posterior probabilities directly. Linear logistics regression (LLR) combines discriminative and parametric approaches. In [14], the log likelihood ratio of genuine and imposter scores is expressed as a linear combination of these scores, and the weight of each modality score is then optimized to minimize the loss function that is derived from the logistic functions of the log likelihood ratio. The main limitation of these parametric approaches is that they can produce poor predictive performances if the chosen model does not fit the actual distribution well.

The nonparametric approaches represent distributions that use histogram bins or control points. Kernel density estimation (KDE) is one of generative and nonparametric approaches. Dass et al. proposed an approach that computed the generalized densities that were estimated from genuine and imposter training samples for each modality and combined them using product rule or copula methods [13]. For discriminative and nonparametric approaches, lattice-type control points are used to represent the Bayes error gradient distribution in a nonparametric manner [44]. In another approach, floating control points are used in conjunction with generalized Delauney triangulation for a more efficient representation [17]. In both methods, the estimation of individual genuine and imposter densities is bypassed and the discriminative function is trained directly.

As shown above, a wide variety of approaches have been proposed for each category in score-level fusion. In this study, we have considered at least two approaches from each generic category to evaluate the performance of the proposed score database.

2.3 Quality-dependent approaches

Quality can be considered to be auxiliary information that affects the matching score. Generally speaking, if biometric samples are of good quality, matching scores of the genuine and imposters are more easily separable. Therefore, quality measurement/assessment algorithms and recognition accuracy improvement using the quality measures, enjoy a large body of literatures in biometrics community [4547].

Quality measures can be used at various stages in the recognition pipeline to improve the recognition accuracy. During the enrollment phase, a quality measure is used as the criterion for sample recapture [48]. In the preprocessing phase, both quality-dependent feature enhancement and quality-dependent target region selection are considered [46]. In the matching phase, different matching algorithms are used to calculate the scores for uni-modal and multi-modal biometrics. In this phase, classifier or distance metrics are selected adaptively depending on the sample quality [49]. Another direction is to directly stack the quality measures into a score vector, i.e., Q-stack vector and to treat it as a feature vector for classification [50, 51]. Moreover, in [52, 53], biometric samples are classified into clusters based on the sample quality, and score normalization or fusion are done in a cluster-dependent way.

3 Single sensor-based multi-quality multi-modal biometric score database

3.1 Overview of the OU-ISIR Gait Database, Large Population Dataset

The OU-ISIR Gait Database, Large Population Dataset [27] was collected with the aim to a statistically reliable performance evaluation of large-scale gait recognition. We used 1912 subjects, a subset of this dataset to generate scores. For this dataset, each subject was asked to walk straight along a predetermined corridor in a natural manner twice, and two walking image sequences were captured for each subject using a single camera placed at a distance of 5 m. The image size was 640 × 480 pixels, and 30 images (frames) were recorded per second (30 fps). Each image sequence is divided into four segments based on observation azimuth angles of 55°, 65°, 75°, and 85°. Examples of the captured images are shown in Fig. 1.

Fig. 1
figure 1

Examples of captured images in the OU-ISIR Gait Database, Large Population Dataset

Silhouette image sequences were generated initially from the captured image sequences using a background subtraction-based graph-cut segmentation method [54]. Head region image features were extracted from the captured image sequences using the information from the silhouette images. Lens distortion removal, rectification, and cropping processes were used in the processing to generate the gait and height features. Rectification was performed using the camera calibration parameters such that the x and y axes of the image plane are located parallel to the walking and vertical directions of the real 3D world, respectively. As a result of the rectification process, calculation of the subject’s height becomes simple; the vertical positions of the foot and the top of the head can be obtained from the bounding box information, and the actual height of the subject in the real 3D world can then be calculated via a simple conversion process using the camera calibration parameters because the distance between the camera and the subject on the walking course was the same for all subjects in the dataset.

3.2 Quality measures

We focus on the SR, the TR, and the view as factors that affect the matching score of each feature, and generate a score database with different qualities using image sequences with different values of these factors. To generate such a score database, we first generated datasets with various qualities by scaling the image sequences down in terms of their SRs. We also downsampled in terms of the TRs and then generated image sequences using different TRs. We then extracted the required gait, head, and height features. Finally, we calculated the matching scores for the gait, the head, and the height using these extracted features, and saved them along with the associated SR, TR, and view values. We considered the following SRs and TRs.

SR: To simulate the image sequences with different image sizes and/or image sequences of subjects at different distances2, we set scaling factors to downsample the original captured image sequences such that the average height of all subjects does not fall below 20 pixels, because extraction is difficult when the SR is too low. Specifically, we downscaled the original images using factors of 1/2, 1/3, 1/4, 1/5, 1/6, and 1/8, and then prepared image sequences with sizes of 640 × 480, 320 × 240, 213 × 160, 160 × 120, 128 × 96, 106 × 80, and 80 × 60 pixels. Consequently, image sequences with seven different SRs were then used to construct the score database.

We followed some preprocessing to get better simulated image sequences. Firstly, while previously release score database [29] used nearest neighbor interpolation, we used area interpolation implemented in OpenCV [55] for more appropriate down-sampling simulation. As for a silhouette sequence, since the area interpolation induces gray-scale values other than binary values (e.g., background or foreground), we applied thresholding to keep it binary a silhouette image sequence after down-sampling. Moreover, because the boundary of down-sampling by the area interpolation (e.g., pixels whose horizontal or vertical position is a multiple of k for down-sampling with factors of 1/k) does not necessarily coincide with the bottom of foot or the top of the head, we randomly shift-up/down the boundary of down-sampling for each subject in order to better simulate the walking position differences among subjects. More specifically, we generated a random number for each subject to shift the silhouette image sequences up or down such that the top of the subject’s head or the bottom parts of their feet are not moved outside the image. The entire silhouette image sequence for each subject was then shifted up/down using the set value, and the same process was applied to each subject.

TR: To simulate image sequences at different frame rates, we prepared 10 different TRs, including one at the original frame rate. We selected images at specific frame intervals from the original image sequences (that we called frame-skipped image sequences), and subsequently generated corresponding image sequences with frame rates of 15, 10, 7.5, 6, 5, 3.75, 3, 2, and 1 fps. Some examples of these normalized silhouette sequences with 30, 10, 6, and 5 fps frame rates are shown in Fig. 2.

Fig. 2
figure 2

Full period normalized silhouette sequences at various TRs from 85° view. The top row shows normalized silhouette sequences at 30 fps; the second row shows normalized silhouette sequences at 10 fps; the third row shows normalized silhouette sequences at 6 fps; and the bottom row shows normalized silhouette sequences at 5 fps. The silhouette sequences were taken from frame-skipped image sequences that started from the first frame

View: The image sequences of the OU-ISIR Gait Database, Large Population Dataset are divided into four segments based on the observation azimuth angle, as shown in Fig. 1. We therefore used the image sequences with observation azimuth angles of 55°, 65°, 75°, and 85°, and use the view information as a quality.

3.3 Matching algorithm for score calculation

We calculated matching scores for the gait, head, and height features to construct the score database. We briefly explain the feature extraction and score calculation processes used here for the gait, head, and height features. Detailed explanations of these processes can be found in [11].

Gait matching: Because the gait energy image (GEI) [56] is the most widely used feature in gait recognition, and because it can achieve good recognition accuracy [27], we used the GEI as a gait feature. For GEI extraction, we first obtained the top, bottom, and horizontal center (i.e., the median of the horizontal axis) of the subject’s silhouette for all frames, and then applied a moving average filter of three frames to obtain smooth positional changes. We subsequently clipped a silhouette image from each original silhouette image based on the associated calculated position while maintaining the aspect ratio, and the clipped silhouette image was then normalized so that the image size was 88 × 128 pixels. These normalized silhouette image sequences were used to calculate the gait period using normalized auto correlation (NAC) for the temporal axis. Finally, we averaged the normalized gait silhouette images over a gait period and generated a GEI. Some examples of these GEIs with different views and SRs are shown in Fig. 3 and GEIs with different TRs are shown in Fig. 4. For the gait-based matching scores, we calculated the Euclidean distance between the gallery and the probe gait features. Let G g and G p be the feature vectors that are associated with the GEIs extracted from the gallery and from the probe image sequences, respectively, and let S gait be the gait-based matching score that is calculated using the gallery and probe GEIs. The gait-based matching score is calculated using

$$ S_{gait}= ||G_{p} - G_{g}||_{2}, $$
Fig. 3
figure 3

GEIs extracted from 30 fps images of various sizes from 85° and 55° views. The top row show GEIs with the 85° view and the bottom row shows GEIs with the 55° view. Image sizes: (a) 640 × 480, (b) 320 × 240, (c) 213 × 160, (d) 160 × 120, (e) 128 × 96, (f) 106 × 80, and (g) 80 × 60 pixels. The sizes given are not for the bounding box, but are for the original/scaled-down images

Fig. 4
figure 4

GEIs extracted from fixed size (640 × 480 pixels) normalized silhouette sequences at various TRs [fps]

where ||·||2 is an operator of the L 2 norm.

Head matching: Many sophisticated face recognition methods have been proposed in the literature, and they generally use the inner region of the face for recognition. However, these methods often cannot achieve reasonable accuracy when the targeted face is extremely small [57, 58]. We therefore used the texture information from the head region, including the inner face region, the hair and the face contour parts in this paper, and call it the head feature of the target subject because our feature differs from the general face feature. Specifically, we have defined the upper region above the neck of a target person as the head region; this concept is similar to that of [59], which extends the iris feature into a periocular feature.

We calculated the head-based matching score using the following multiple steps. First, we used the silhouette image associated with the target image as a mask, and localized the head region. We then extracted the image of the head region and set it as a template for the head feature of the image. We then extracted a template from each frame of the probe image sequences separately for each quality. Finally, we apply the conventional template matching algorithm using the template to the gallery image sequences, and then calculate a head-based matching score. In contrast, the previously released score database [29] extracted both the template and the gallery image from the original captured image sequences only and not those for all SR qualities, and then downsampled to generate the different SR qualities. For the score calculation process, we use the color texture information. Let \(F_{p_{i}}\) be the template of the head feature associated with the ith frame of the probe. Let \(F_{g_{j,k}}\) be a masked image of the same size that is associated with the jth frame and the kth spatial displacement within the gallery search regions. We then calculate the head-based matching score using correlation-based template matching using

$$ S_{head} = \min_{i,j,k} \left[1-f_{NCC}(F_{p_{i}},F_{g_{j,k}})\right]. $$

Here, \(f_{NCC}(F_{p_{i}},F_{g_{j,k}})\phantom {\dot {i}\!}\) is an operator that is used to calculate the normalized cross-correlation (NCC) between \(F_{p_{i}}\) and \(F_{g_{j,k}}\), and template matching was performed within the region of interest that is defined by the silhouette mask of the gallery.

Note here that we did not use the advanced techniques that are associated with face recognition, including the feature extraction process [60], pose normalization [61], and/or face alignment [62]. In this paper, we consider multiple image sequences over a wide SR range, from mid-level SR (see Fig. 5a) to extremely low-level SR (see Fig. 5d); the advanced techniques do not work consistently well on image sequences with wide SR ranges, but the NCC-based technique works stably for these image sequences. While the NCC-based technique is simple and not a state-of-the-art algorithm, its working stability property is essential for the work in this paper. This is because the same techniques should be used to image sequences with different SR values.

Fig. 5
figure 5

Head templates extracted from both the original images and the scaled-down images. The top row shows template images with the 85° view and the bottom row shows template images with the 55° view. The head template size is dependent on both subject and view; in this figure, the template sizes of the original image are 30 × 26 and 26 × 20 pixels for the 85° and 55° views, respectively. a Original. b Half. c Quarter. d One-eighth

Height matching: Because of the rectification process and the fact that each subject walked in a straight line at a fixed depth from the calibrated camera, the actual height of each subject in the real 3D world can easily be computed from the apparent height from the image by simply multiplying the ratio of the focal length and the depth from the subject. Note that the actual height was computed in a frame-by-frame manner and the height of the subject can vary with the up-and-down motion caused by walking; we therefore define the height feature as an average of the height values computed over the image sequence. Let Z i be the computed height from the ith frame of a target image sequence, and let h be the height feature (scalar). We assume that the target image sequence is composed of N f pieces of frames and then the height feature can be computed using

$$ h=\frac{1}{N_{f}}\sum\limits^{N_{f}}_{i=1}Z_{i}. $$

Let h p and h g be the heights of the subjects in the probe and in the gallery, respectively. The height-based matching score S height was calculated to be

$$ S_{height}=|h_{p}-h_{g}|. $$

3.4 Constructed score database

Because we considered three independent quality measures, we could then set several different quality settings by combining different values of the SR, TR, and view. Using these combinations, we constructed a multi-quality, multi-modal score database.

When we generate image sequences with different TRs by a process of selection of specific frame intervals, multiple image sequences can then be generated from a single image sequence because the selected frames must be different and are dependent on the starting frames. For example, we consider two different image sequences with frame rates of 15 fps; an image sequence composed of an odd number of frames and an image sequence composed of an even number of frames can be generated from a single image sequence with a rate of 30 fps. In this paper, all frame-skipped image sequences that started from different frames were considered as probes, while only a single frame-skipped image sequence that started from the first frame is considered as the gallery to avoid a quadratic increase in the number of scores. We therefore used 2, 3, 4, 5, 6, 8, 10, 15, and 19 3 frame-skipped image sequences per probe for frame rates of 15, 10, 7.5, 6, 5, 3.75, 3, 2, and 1 fps, respectively. We denote the number of frame-skipped image sequences by N TR .

We then generated features for each of the N TR probes and galleries separately for each modality, for example, by generating normalized silhouette sequences for each N TR probe and for the first frame-skipped image sequences for the galleries. The first frame-skipped image sequences for the same subject with different TR are shown in Fig. 2. Finally, we calculated the score matrices for the gait, the head, and the height between 1912 N TR probes and 1912 galleries for all combinations of the SRs (seven variations), the TRs (10 variations) and the views (four variations), which gives a total of 280 different quality settings with the different combinations of the SRs, TRs, and views. It should be noted that the score matrices were calculated between probes and galleries with the same quality values. The constructed score database4 includes the matching score distance matrices and the subject ID list along with the quality measures (i.e., the SRs, TRs, and views). As a result, we can draw on N TR score distance matrices with 1912 N TR genuine (true match) scores and 1912 N TR × 1,911 = 3,653,832 N TR imposter (false match) scores for each biometric modality and quality, which results in a very large-scale database containing 3,908,128 genuine scores and 7,468,432,608 imposter scores for each modality.

4 Performance evaluation

4.1 Overview

We performed two evaluations of the constructed score database, evaluations of each modality, and evaluations of the score-level fusion. The purpose of the first evaluation is to analyze the properties of each modality score against the various quality conditions, and/or the impact of quality on each modality; we therefore evaluated the recognition accuracy for each modality independently for different quality settings. The second evaluation was performed to give baseline accuracy for score-level fusion. Because we expect the score database constructed here to be used in score-level fusion research to improve the score-level fusion approaches, the recognition accuracies when using the benchmark algorithms, including the state-of-the-art algorithms, are useful. For this purpose, we set two different protocols, i.e., quality-independent and quality-dependent protocols, and provide the recognition accuracies of the benchmarks under both protocols.

4.2 Accuracy analysis of each modality and impact analysis of each quality

In this section, we analyze the recognition accuracy of each modality under different quality conditions. We evaluate the recognition accuracy in two different modes: verification and identification. For this purpose, we used the receiver operating characteristic (ROC) curve, which indicates the trade-off between the false rejection rate (FRR) of genuine and the false acceptance rate (FAR) of imposter with varying thresholds for verification and a cumulative matching curve (CMC) for identification. Because of space limitations, we only report the ROC and CMC curves of all modalities under typical settings for the quality considered in Figs. 6 and 7. Additionally, we summarize the equal error rates (EERs) and the Rank-1 identification rates in Table 3 with Fig. 8 and Table 4 with Fig. 9, respectively. We can see that each quality affects the accuracy of each modality.

Fig. 6
figure 6

ROC curves for individual modalities. The left column is for the 85° view and the right is for the 55° view. Note that the scales differ among the graphs. a 640 × 480 pixels SR with 30 fps TR. b 640 × 480 pixels SR with 1 fps TR. c 80 × 60 pixels SR with 30 fps TR. d 80 × 60 pixels SR with 1 fps TR

Fig. 7
figure 7

CMC curves for individual modalities. The left column is for the 85° view and the right is for the 55° view. Note that the scales differ among the graphs. a 640 × 480 pixels SR with 30 fps TR. b 640 × 480 pixels SR with 1 fps TR. c 80 × 60 pixels SR with 30 fps TR. d 80 × 60 pixels SR with 1 fps TR

Fig. 8
figure 8

EERs [%] for the individual modalities and qualities. Note that the scales used differ among the graphs

Fig. 9
figure 9

Rank-1 identification rate [%] for individual modalities and qualities. Note that the scales differ among the graphs

Table 3 EERs [%] for individual modalities and qualities
Table 4 Rank-1 identification rates [%] for individual modalities and qualities

A gait feature includes shape information combined with motion information, and thus the recognition accuracy of gait features is affected not only by the SR, but also by the TR. When the TR is 30 fps, the gait recognition accuracy degrades as the SR decreases, but the degradation is moderate during verification. In this range, we can see that the motion information provides a reasonable discriminative ability, and thus, even with a low SR, reasonable accuracy is maintained. In the middle TR range, the gait accuracy degrades as the SR decreases much more clearly than that at 30 fps during both verification and identification. It can be considered that, in this range, some of the motion information is missing, and the shape information plays a much more important role than it does in the high TR scenario. The shape information is missing when the SR is low, and this missing information affects the recognition accuracy directly. In contrast, under low TR conditions, the gait recognition accuracy does not vary much as the SR changes. In this range, the gait feature only provides a low level of information for discrimination, and this information does not decrease as the SR decreases during verification, but some useful information is lost for the identification process.

For the TR, the gait recognition accuracy generally degrades as the TR decreases, but there are some exceptions. We can see that the Rank-1 identification rate at 6 fps was lower than that at 5 fps and the rate at 3.75 fps was lower than that at 3 fps for the 85° view. This occurred because of temporal aliasing when an image is generated by downsampling from the original image sequences. Specifically, we downsampled the original image sequences to simulate the image sequences with different frame rates. The gait is an almost bilaterally symmetrical motion, and thus if we downsample the image sequence of the side view (i.e., 85°) with a specific frame interval, then images with the same stance appear. As shown in Fig. 2, the first, second and third silhouette images were similar to the fourth, fifth and sixth images, respectively, for the 6 fps image sequences, and therefore the gait information that is included in the GEI at 6fps is smaller than that at 5 fps. This is the cause of the inconsistency. With regard to the view, large differences cannot be observed, but the recognition accuracy at 55° seems to be slightly worse than that of the other view.

Head features are essentially static information, and therefore recognition accuracy severely affected by SR. For example, when TR is 1 fps, recognition accuracy degrade drastically as SR becomes low. Head feature is also affected by TR. When SR is 640 ×480 pixels, recognition accuracy improved as TR becomes high. Head modality is affected by a number of factors such as illumination changes and pose. Because head features were extracted from walking image sequences, illumination and pose can be different in each frame. Consequently, data with high TR can include more head features with variety of illumination and pose than data with low TR, and this leads to accuracy improvement.We also observe that recognition accuracy of head modality is different depending on view. We think three factors can be the cause. The first factor is that the texture information included in the head modality is changed depending on observation view. The second factor is that the observed shape of head region is also different depending on observation view. The third factor is size of the head region. Because data with different views were captured from a single walking person by a single camera in this data set, distance between the sensor and the subject is different in each view as shown in Fig. 1. And this difference leads to the size difference.

As for height modality, we observe that the accuracy of height is affected by not only SR but also TR. Because the height of the subject is calculated by averaging the height of each frame, and the height of each frame is affected by the posture. In the case of TR is low, the calculated height is heavily affected by the postures of the sampled frames, and this can lead to accuracy degradation. As for view quality, the impact of view difference is not so large, because essentially, height is view-invariant feature.

4.3 Evaluation protocols for score-level fusion

Protocol 1: Constructed score databases can be used to develop and evaluate quality-independent score-level fusion approaches. Because our constructed score database contains many subsets with different quality settings, and thus the properties of each subset can be different, it is possible to compare multiple algorithms on these subsets with different properties. This is the main reason why we set Protocol 1 for the constructed score database. The purpose of this protocol is to enable benchmarking of quality-independent score-level fusion approaches. To prepare subsets with different properties, we selected two typical settings for each quality; i.e., we selected high and low SRs of 640 × 480 pixels and 80 × 60 pixels, respectively, high and low TRs of 30 fps and 1 fps, respectively, and two views of 85° (almost a side view) and 55° (an oblique view). Having considered all of the combinations of these qualities, we consequently prepared subsets with eight typical settings, denoted by Q H H85, Q H L85, Q L H85, Q L L85, Q H H55, Q H L55, Q L H55, and Q L L55, where the first and the second subscripts denote high (H) or low (L) for the SR and TR, respectively, and the third subscript denotes the view (e.g., Q H H85 corresponds to 640 × 480 pixels at 30 fps from the 85° view). For the evaluation, each subset was randomly divided into training and test sets, disjointed with respect to the subjects, and twofold cross-validation was performed. This twofold cross-validation was repeated 10 times to reduce the effects of the random divisions. The scores were normalized before fusion.5

A variety of normalization schemes have been introduced in the literature, including min-max normalization, z-score normalization [38], and F-normalization [39], but some of these schemes (e.g., min-max normalization) are sensitive to outliers [16]. We therefore used a more statistically-based scheme, i.e., z-score normalization, because it is widely used in research into both multi-modal biometrics and score-level fusion [11]. Specifically, we computed an average μ m and a standard deviation σ m for each modality m{g a i t,h e a d,h e i g h t} among the training set, and computed a normalized score \(\bar {S}_{m}\) from a raw score S m as

$$ \bar{S}_{m}=\frac{S_{m}-\mu_{m}}{\sigma_{m}}. $$

Because we intend to perform normalization with respect to the modality, we simply picked a specific quality, i.e., Q H H85, to compute the average and the standard deviation for the normalization process and used these values for the different subsets.

To evaluate the performance, we selected two settings: multi-modal fusion (for gait, head and height) and bi-modal fusion (for gait and head). We then evaluated the accuracy in both verification and identification scenarios with typical measures such as ROC curves, EERs, FRRs at specific FARs, area under curves (AUCs), the half total error rates (HTERs), which is the average of the FAR and the FRR, CMC curves, and rank-n identification rates for each subset. Here, the HTER is calculated based on [34]. We select an optimal threshold Δ based on the concept that the distributions of genuine and imposter accesses are equal and the threshold is set at a value that minimizes \(\frac {1}{2}\left (FAR\left (\Delta \right)+FRR\left (\Delta \right)\right)\). We set the threshold using the training dataset.

Protocol 2: A major characteristic of the constructed score database is that this database is composed of multi-modal scores with multiple different qualities. Therefore, this score database is appropriate for evaluation of quality-dependent score-level fusion approaches. This protocol is intended for benchmarking of quality-dependent multi-modal score-level fusion approaches where the score database is disjoint in terms of both subjects and qualities, unlike Protocol 1, where the same qualities are shared between the training and test sets. The quality settings for the training and test sets are defined as shown in Table 12. Specifically, we selected three settings for the SRs: “640 × 480, 213 × 160, 128 × 96, 80 × 60”, “640 × 480, 80 × 60”, and “213 × 160, 128 × 96”. For the settings of “640 × 480, 213 × 160, 128 × 96, 80 × 60”, and “640 × 480, 80 × 60”, all test set qualities lie between the training qualities; the difference between these two spatial settings is the density of the quality. For the TRs, we considered four settings: “30, 10, 7.5, 5, 3, 1”, “30, 1”, “10, 7.5, 5, 3”, and “7.5, 5”. For the settings of “30, 10, 7.5, 5, 3, 1” and “30, 1”, all test set qualities lie between the training qualities; however, some of the test qualities do not lie between the training qualities in “10, 7.5, 5, 3”, and “7.5, 5”. Using a combination of these settings for the SRs and TRs, we designed the experiment for Protocol 2. We believe that this protocol is both interesting and important for score-level fusion research. In this protocol, the test sets are composed of scores with different quality settings. For the evaluation, we do not report on the accuracy of each test set separately, but report on the total accuracy based on merging of the test scores with different quality settings.

4.4 Benchmarks for score-level fusion

Protocol 1: Score-level fusion approaches mainly fall into three generic categories: transformation-based, classification-based, and probability density-based approaches, as described in subsection 2.2. We therefore provide a total of seven benchmarks based on a wide variety of score-level fusion approaches for Protocol 1. Specifically, we provide the sum rule (Sum) and the minimum rule (Min) for the transformation-based approaches [12]; an SVM with a radial basis function kernel [16] and RankSVM [28] for the classification-based approach; and the GMM6 [15], LLR [14], and KDE [63] for the probability density-based approaches.

Protocol 2: We provide both quality-independent and quality-dependent score-level fusion approaches to discuss the effectiveness of the quality-dependent approaches when compared with the quality-independent approaches. Specifically, we first select the three best approaches with respect to their verification performances in a quality-independent protocol: they are GMM, LLR, and RankSVM, in addition to Sum, which is used as a baseline. To estimate the parameters of these models in a quality-independent manner, we trained the parameters using the training set with the highest SR and TR from the side view (which is Q H H85). Using these parameters, we can then realize quality-independent approaches. We also introduce the Q-stack [51], which is a concatenated vector of the multi-modal biometric scores and the quality measures. Using this Q-stack, we can then realize quality-dependent approaches. In our database, the Q-stack is defined as a six-dimensional vector that is composed of three biometric scores, i.e., the S gait , the S head , and the S height , along with three quality measures: q SR for SR, q TR for TR and q View for view. Note that we define q SR and q TR based on a logarithmic scale of spatial reduction and a frame rate for numerical stability, respectively, and q View is based on a view in units of radians. The Q-stack vectors of the training sets are then fed into each of the GMM, LLR and RankSVM methods to create quality-dependent versions, which are called GMM (Q-stack), LLR (Q-stack) and RankSVM (Q-stack), respectively.

4.5 Evaluation result for score-level fusion

Protocol 1: The performance evaluation results for Protocol 1 for multi-modal fusion are indicated by the ROC and CMC curves see Figs. 10 and 11. In addition, we selected some typical measures, including the EERs, FRRs at 1% and the 10% FARs, and the HTERs along with the AUC for the ROC, which are related to verification and are summarized in Tables 5, 6, 7, 8 and 9, and Rank-1 and Rank-5 for the identification are summarized in Table 10 for both the multi-modal and bi-modal fusion types.

Fig. 10
figure 10

ROC curves for Protocol 1 (the first cross-validation set) for multi-modal fusion. Note that the scales differ from graph to graph. a Q H H85 b Q H L85. c Q L H85. d Q L L85. e Q H H55. f Q H L55. g Q L H55. h Q L L55

Fig. 11
figure 11

CMC curves for Protocol 1 (the first cross-validation set) for multi-modal fusion. Note that the scales differ from graph to graph. a Q H H85 b Q H L85. c Q L H85. d Q L L85. e Q H H55. f Q H L55. g Q L H55. h Q L L55

Table 5 EER [%] for Protocol 1
Table 6 FRR [%] at 1% FAR (FRR 1% ) for Protocol 1
Table 7 FRR [%] at 10% FAR (FRR 10% ) for Protocol 1
Table 8 AUC [%] for Protocol 1
Table 9 HTER [%] for Protocol 1
Table 10 Rank-1/5 identification rates for Protocol 1

From these multi-modal and bi-modal fusion results, we observed that accuracy of multi-modal is better than that of bi-modal in all cases of verification and in majority cases of identification. Moreover, we see that some of the probability density-based methods such as GMM and LLR and classification-based approach such as RankSVM perform stably and well in verification scenarios (see Fig. 10) because probability density-based approaches guarantee optimality in terms of the ROC [64] on the condition that the estimated probability densities are correct. One interesting observation is that the efficiency of the KDE, which is also a probability density-based approach, is unstable and is heavily dependent on the quality settings. The KDE assumes that each score is independent. We therefore believe that this independent assumption may be a cause of the unstable efficiency. From the viewpoint of the quality settings, we can see that both the SRs and the TRs have major effects on the accuracy of the fusion approaches. In the case of both high SR and high TR (e.g., Q H H85), because all the modalities work relatively well, the improvements in accuracy when compared with the Sum as a baseline are not as large (e.g., 1.9% EER for Sum and 1.4% EER for the LLR for multi-modal fusion). In the case of high SR and low TR (e.g., Q H L85), because the head modality still works well, while the gait modality does not work because of the low frame rate, the accuracy improvement when compared with the Sum is significant. While the EER of the Sum is 23.0%, the EERs of the RankSVM and the LLR are 7.0 and 7.3%, respectively, for multi-modal fusion. These results indicate the importance of adaptive weighting of the modalities. Similarly, we can see accuracy improvements in the case of low SR with high TR, in which case the head modality does not work but the gait modality still works. Also, the view affects the accuracy of the fusion approaches. When compared with the EERs of views of 85° and 55° under low SR and low TR conditions for multi-modal fusion, the GMM, LLR and RankSVM achieves better accuracy for the 85° view, but the LLR, and RankSVM achieve better accuracy for the 55° view.

With regard to the identification scenarios (see Fig. 11 and Table 10), the accuracy trend for each benchmark is, however, slightly different from that for the verification scenario, i.e., the RankSVM achieves the best or second best results for all qualities except for the rank-1 identifix cation rate of Q L L85 for multi-modal fusion. This point will be discussed in greater depth in the discussion section. We then select the best methods with respect to both verification and identification in the subsequent evaluation of Protocol 2. Specifically, we selected the three best methods, GMM, LLR, and RankSVM, and used them in the following accuracy analysis for the quality-dependent settings, i.e., for Protocol 2.

Protocol 2: Firstly, we evaluated the accuracies of quality-independent and dependent approaches (e.g., using training Set 1) and drew the ROC and CMC curves as shown in Fig. 12 and summarized the EERs and FRRs at FARs of 1 and 10%, along with the AUCs, HTER and Rank-1 and Rank-5 as shown in Table 11. In the verification scenario, we see that the quality-independent approaches such as Sum, GMM, LLR, and RankSVM do not perform well, with EERs of more than 25%. Conversely, the quality-dependent approaches, such as LLR (Q-stack) and RankSVM (Q-stack) achieve much higher accuracies than the quality-independent approaches (e.g., the EERs for the LLR (Q-stack) and the RankSVM (Q-stack) are 13.0 and 20.9%, respectively), while the EERs of the LLR and RankSVM are 25.8 and 28.8%, respectively.) One exception to this trend is GMM (Q-stack). The accuracy of GMM (Q-stack) is worse than that is a quality-independent approach. This exceptionally poor performance by the GMM (Q-stack) is caused by degeneration of the GMM covariance matrices due to the discrete training qualities used (see Table 12). As a result, the probability density for the test quality, which is different to the training quality, becomes zero for both positives and negatives, and the fused score therefore loses its discrimination capability.

Fig. 12
figure 12

ROC (left) and CMC (right) curves for the quality-independent and quality-dependent approaches for Protocol 2

Table 11 EER, FRR 1% ,FRR 10% , AUC, HTER and Rank-1/5 identification rates of quality-independent and quality-dependent approaches for Protocol 2
Table 12 Qualities for training and test sets in Protocol 2

As identification scenario, we can see that quality-independent approach works better than quality-dependent approach in general. This results can be related to the evaluation settings. In this paper, although biometric data with multiple qualities were prepared, cross-quality matching were not considered, in other words, qualities of probe gait feature are always the same as those of gallery gait features. Under this setting, quality information given by Q-stack cannot be useful in identification, because all gallery gait features matched to a probe gait features have the same quality measures. Therefore, we cannot improve identification accuracy by only employing Q-stack under this setting. This implies that the additional quality-dependent approaches are expected to improve the identification accuracy in the scenario where the qualities of the probe and the gallery are the same.

Second, we evaluated the quality-dependent fusion performance when using different training sets to analyze how the performance varied. For this purpose, we selected the best approach LLR (Q-stack) for Set 1 to act as a benchmark. Figure 13 shows the ROC and CMC curves and the EERs and FRRs at FARs of 1 and 10%, along with the AUCs, the HTER, and Rank-1 and Rank-5, as shown in Table 13.

Fig. 13
figure 13

ROC (left) and CMC (right) curves for the quality-dependent approach LLR (Q-stack) when using different training sets for Protocol 2

Table 13 EER, FRR 1% , FRR 10% , AUC, HTER and Rank-1/5 identification rates of the quality-dependent approach using the different training sets for Protocol 2

From these results, we can obtain some interesting observations. In cases where we fix the TR quality setting for training, SR quality setting for training does not have much impact on accuracy for both verification and identification. For example, the EERs of Set 1, Set 5, and Set 9 are 13.0, 13.1, and 12.9%, respectively, and Rank-1 of Set 1, Set 5, and Set 9 are 31.1, 32.2, and 29.5%, respectively. In cases where we fix the SR quality setting for training, TR quality setting for training have much impact on recognition accuracy, and impact on verification and identification are different. For verification, in cases where test TR qualities lie between the training TR qualities, the EERs of Set 1 and Set 2 are both 13.0%, and the EERs of Set 9 and Set 10 are almost the same (i.e., 12.9 and 13.0%) even though the densities of TR training qualities are different; but in cases where some test TR qualities lie outside the training qualities, verification accuracy become worse. For example, the EERs of Set 4 and Set 12 are 15.8 and 15.2%, respectively. On the other hand, for identification, in case where TR qualities of “7.5 and 5” are used for training, we observed that Rank-1 becomes good even though some test TR qualities lie outside the training TR qualities. As discussed in the first experiment of Protocol 2, this may be related to the evaluation settings where the qualities of the gallery and the probe are the same. But we guess that some specific fusion parameters may work well on many quality settings for identification.

5 Discussion and future work

Inconsistency between verification and identification scenarios: As written in the section on the evaluation results for score-level fusion, the benchmark performance is dependent on the specific verification or identification scenario. Since Nandakumar et al. [65] proposed a method to extend the fusion for the verification scenario (i.e., using a likelihood ratio-based approach) into that for the identification scenario, it may be expected that the best benchmark for the verification will also achieve the highest accuracy in the identification scenario. Their method [65] imposes the strong assumption that the genuine and imposter scores are drawn from the same distribution, regardless of the subjects. However, subject dependence of the score distribution has been observed in our score database, and good benchmarks are therefore shown not to be consistent among the verification and identification scenarios on our score database. In fact, a recent study [66] performed a preliminary experiment to demonstrate that a low-performance verification system may still achieve good performance in an identification scenario. Additionally, DeCann and Ross demonstrated in [43] that sets of genuine and imposter scores that generate the same ROC curve can generate different CMC curves. This is because the verification performance is dependent on the aggregated distributions of the genuine and imposter scores, while the identification scenario is dependent on probe-dependent ranking statistics. Suitable approaches for the verification and identification scenarios can be different in principle. It was also reported in [43] that this type of difference between the ROC and CMC curves tends to be particularly outstanding for soft biometrics such as gait. It is therefore convincing in this case that the different benchmarks yielded higher accuracies for each of the verification and identification scenarios. Specifically, the probability density-based approaches consider the aggregated score distributions directly and thus yielded higher accuracies for verification. In contrast, because the RankSVM considers the probe-dependent rank statistics directly, it yielded the best or second best accuracies for the identification scenarios for multi-modal fusion, which is consistent with the discussion above.

Multiple algorithms: While we considered multiple scores that were derived from multi-modal biometrics and provided a single matcher for each modality in this work, it is also possible to introduce multiple scores that are derived from multiple matchers, as per the existing biometric score database NIST-Multimodal [37], which contains scores that are derived from multiple face matchers. From a quality-dependent score-level fusion viewpoint, it is particularly interesting to introduce matchers with different sensitivities into the quality measures that were used in this paper (i.e., SR and TR). For example, gait feature representations that are encoded with more temporal and/or motion information (e.g., [67, 68] may be sensitive to TR variations (i.e., yielding higher accuracies for higher TRs), while those that are encoded with more static (shape) information (e.g., [69]) may be insensitive to the TR.

Additionally, we can improve the sensitivity to both the SR and the TR by incorporating spatial and/or temporal super-resolution techniques [23]. In fact, spatially super-resoluted face images are used to fuse gait with side-view face in [70]. Construction of an advanced multi-modal biometric score database that includes both multiple matchers and super-resolution techniques would therefore be an interesting future research direction.

Extension to real scenes: In addition, while we treat the SR and TR as quality measures in the context of the fusion of gait, head, and height biometrics, there are even more qualities that can be considered in real situations. In particular, because our biometric score database was built upon the OU-ISIR Gait Database, Large Population Dataset [27], which was collected in a relatively controlled situation (e.g., indoors, with controlled illumination and a predefined course) with a limited observation view, we must consider the variety of covariate factors that may well arise in a real situation. For example, background motion artifacts that are caused by trees or additional persons may degrade the silhouette qualities that are required for gait biometrics, and illumination changes caused by cloud cover could dramatically change head textures. It would therefore be useful to collect biometric scores in real situations while using additional quality measures (e.g., illumination changes, view changes, clothing, and silhouette qualities) to achieve a more sophisticated fusion approach. This will required not only data collection but also generation of algorithms to measure the associated quality, specifically sample-based quality. This will also be included in our future research. We guess that Q-stack based quality-dependent fusion can improve accuracy even for identification if we employ sample-based quality measures, because the qualities of the probe and the gallery are not always the same. Because fusion of a high-dimensional Q-stack vector containing a plethora of multi-modal multi-matcher biometric scores along with the quality measures described above represents a more challenging task, it is also essential to add a benchmark for the score-level fusion that incorporates dimensional reduction of the quality measures as a preprocessing step (e.g., [71]) to avoid both the risk of overfitting and the curse of dimensionality problem.

6 Conclusion

We constructed a single sensor-based multi-quality multi-modal biometric score database. We focused on the OU-ISIR Gait Database, Large Population Dataset. We extracted the required gait, head, and height biometrics from a single walking image sequence. We considered the SR, the TR, and the view as quality measures. As a result, the database contains an extremely large number of biometric scores: approximately 4 million genuine scores and 7.5 billion imposter scores for a total of 280 combinations of SRs, TRs, and views. We defined two protocols for quality-independent and quality-dependent score-level fusion, and also provided performance evaluation results with several benchmarks for each protocol, which will advance the research into quality-dependent score-level fusion. The constructed score database has been opened to the research community with the benchmark results. We therefore believe that the constructed score database with its benchmark results will be highly beneficial for biometric researchers because score-level fusion is a promising technique for improvement of the recognition accuracy.

7 Endnotes

1 The database and evaluation protocol settings is available at

2 Because the distance from the camera to the subject is sufficiently large when compared with the subject size, the assumption of weak perspective projection with respect to the subject can be almost true, and we can thus simulate image sequences of the same subjects at different distances.

3 While 30 frame-skipped image sequences can be generated from 1 fps downsampling of an original 30 fps video in principle, only 19 frame-skipped image sequences are used. Because the minimum number of frames, including one gait image sequences of some subjects, is only 19, we maintain consistency among all subjects by limiting the number to 19.

4 Matching score distance matrices for each modality and quality setting will be published separately in comma-separated values (CSV) format, where the rows and columns correspond to the probes and the galleries, respectively.

5 Note that score normalization does not essentially affect the performance for training-based approaches (with the exceptions of Sum and Min) because any differences in the score scales are absorbed in the training process.

6 The number of the mixture component are determined to be between 1 and 20 to optimize the minimum description length criterion, and the random selection process of the initial seeds is repeated 10 times for each number of mixture components to mitigate the effects of randomness.


  1. Jain AK, Ross A, Pankanti S (2006) Biometrics: a tool for information security. IEEE Trans Inf Forensics Secur 1(2): 125–143.

    Article  Google Scholar 

  2. Bouchrika I, Goffredo M, Carter J, Nixon M (2011) On using gait in forensic biometrics. J Forensic Sci 56(4): 882–889.

    Article  Google Scholar 

  3. Iwama H, Muramatsu D, Makihara Y, Yagi Y (2013) Gait verification system for criminal investigation. IPSJ Trans Comput Vis Appl 5: 163–175.

    Article  Google Scholar 

  4. Jain AK, Ross A (2004) Multibiometric systems. Commun ACM 47(1): 34–40. doi:10.1145/962081.962102.

    Article  Google Scholar 

  5. Ross AA, Nandakumar K, Jain AK (2006) Handbook of Multibiometrics. Int. Series on Biometrics. Springer-Verlag New York, Inc., Secaucus.

    Google Scholar 

  6. Besbes F, Trichili H, Solaiman B (2008) Multimodal biometric system based on fingerprint identification and iris recognition In: Proc. of the 3rd International Conference on Information and Communication Technologies: From Theory to Applications (ICTTA 2008), 1–5.. IEEE, Damascus.

    Google Scholar 

  7. Fierrez-Aguilar J, Ortega-Garcia J, Gonzalez-Rodriguez J, Bigun J (2005b) Discriminative multimodal biometric authentication based on quality measures. Pattern Recog 38(5): 777–779.

    Article  Google Scholar 

  8. Jillela R, Ross A (2012) Mitigating effects of plastic surgery: Fusing face and ocular biometrics In: Proc. of the 5th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS 2012), 402–411. doi:10.1109/BTAS.2012.6374607.

  9. Hofmann M, Schmidt SM, Rajagopalan A, Rigoll G (2012) Combined face and gait recognition using alpha matte preprocessing In: Proc. of the 5th IAPR International Conference on Biometrics, 1–8.. IEEE, New Delhi, India.

    Google Scholar 

  10. Zhou X, Bhanu B (2008) Feature fusion of side face and gait for video-based human identification. Pattern Recog 41(3): 778–795.

    Article  MATH  Google Scholar 

  11. Muramatsu D, Iwama H, Makihara Y, Yagi Y (2013) Multi-view multi-modal person authentication from a single walking image sequence In: Proc. of the International Conference on Biometrics, 1–8. doi:10.1109/ICB.2013.6612979.

  12. Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3): 226–239.

    Article  Google Scholar 

  13. Dass SC, Nandakumar K, Jain AK (2005) A principled approach to score level fusion in multimodal biometric systems In: Proc. of the Audio and Video based Biometric Person Authentication, 1049–1058.. Springer Berlin Heidelberg, NY.

    Chapter  Google Scholar 

  14. Alonso-Fernandez F, Fierrez J, Ramos D, Ortega-Garcia J (2007) Dealing with sensor interoperability in multi-biometrics: the upm experience at the biosecure multimodal evaluation In: Proc. of the SPIE Defense and Security Symposium, Biometric Technologies for Human Identification. BTHI, vol 6944.. SPIE, Orlando, FL, USA.

    Google Scholar 

  15. Nandakumar K, Chen Y, Dass SC, Jain AK (2008) Likelihood ratio-based biometric score fusion. IEEE Trans Pattern Anal Mach Intell 30(2): 342–347.

    Article  Google Scholar 

  16. He M, Horng SJ, Fan P, Run RS, Chen RJ, Lai JL, Khan M, Sentosa KO (2010) Performance evaluation of score level fusion in multimodal biometric systems. Pattern Recogn 43(5): 1789–1800.

    Article  MATH  Google Scholar 

  17. Makihara Y, Muramatsu D, Hossain M, Iwama H, Ngo T, Yagi Y (2014) Score-level fusion by generalized delaunay triangulation. Proc of the 2nd International Joint Conference on Biometrics. IEEE, Clearwater.

    Google Scholar 

  18. Bengio S, Marcel C, Marcel S, Mariethoz J (2002) Confidence measures for multimodal identity verification. Inf Fusion 3(4): 267–276.

    Article  Google Scholar 

  19. Krichen E, Garcia-Salicetti S, Dorizzi B (2007) A new probabilistic iris quality measure for comprehensive noise detection In: Proc. of the International Conference on Biometrics: Theory, Applications, and Systems, 1–6. doi:10.1109/BTAS.2007.4401906.

  20. Poh N, Bourlai T, Kittler J (2010) A multimodal biometric test bed for quality-dependent, cost-sensitive and clientspecific score-level fusion algorithms. Pattern Recogn 43(3): 1094–1105.

    Article  MATH  Google Scholar 

  21. Harriero A, Ramos D, Gonzalez-Rodriguez J, Fierrez J (2009) Analysis of the utility of classical and novel speech quality measures for speaker verification. In: Tistarelli M Nixon MS (eds)ICB 2009. LNCS, vol 5558, 434–442.. Springer, Heidelberg.

    Google Scholar 

  22. Muramatsu D, Shiraishi A, Makihara Y, Uddin M, Yagi Y (2015) Gait-based person recognition using arbitrary view transformation model. IEEE Trans Image Proc 24(1): 140–154.

    Article  MathSciNet  Google Scholar 

  23. Akae N, Mansur A, Makihara Y, Yagi Y (2012) Video from nearly still: an application to low frame-rate gait recognition In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1537–1543.. IEEE, Providence.

    Google Scholar 

  24. Kimura T, Makihara Y, Muramatsu D, Yagi Y (2014) Quality-dependent score-level fusion of face, gait, and the height biometrics. IPSJ Trans Comput Vis Appl 6: 543–547.

    Google Scholar 

  25. Toh K, Yau W, Lim E, Chen L, Ng C (2004) Fusion of auxiliary information for multi-modal biometrics authentication. In: Zhang D Jain AK (eds)ICBA 2004. LNCS, vol 3072, 678–685.. Springer, Heidelberg.

    Google Scholar 

  26. Zhou X, Bhanu B (2006) Feature fusion of face and gait for human recognition at a distance in video In: Proc. of the 18th International Conference on Pattern Recognition, vol 4. Hong Kong, China, 529–532.. IEEE.

  27. Iwama H, Okumura M, Makihara Y, Yagi Y (2012) The OU-ISIR gait database comprising the large population dataset and performance evaluation of gait recognition. IEEE Trans Inf Forensics Secur 7(5): 1511–1521.

    Article  Google Scholar 

  28. Chapelle O, Keerthi SS (2010) Efficient algorithms for ranking with SVMs. Inf Retr 13(3): 201–215.

    Article  Google Scholar 

  29. Kimura T, Makihara Y, Muramatsu D, Yagi Y (2015) Single sensor-based multi-quality multi-modal biometric score database and its performance evaluation In: Proc. of the International Conference on Biometrics, 1–8.. IEEE, Phuket.

    Google Scholar 

  30. Jain AK, Nandakumar K, Ross A (2005) Score normalization in multimodal biometric systems. Pattern Recogn 38(12): 2117–2127.

    Article  Google Scholar 

  31. Poh N, Bengio S (2005a) Can chimeric persons be used in multimodal biometric authentication experiments? In: Renals S Bengio S (eds)MLMI 2005. LNCS, vol 3869, 87–100.. Springer, Heidelberg.

    Google Scholar 

  32. Wayman JL (2006) A path forward for multi-biometrics In: Proc. of the IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 1069–1072.. IEEE, Toulouse.

    Google Scholar 

  33. Ortega-Garcia J, Fiérrez J, Alonso-Fernandez F, Galbally J, Freire MR, Gonzalez-Rodriguez J, García-Mateo C, Alba-Castro JL, González-Agulla E, Muras EO, Garcia-Salicetti S, Allano L, Ly VB, Dorizzi B, Kittler J, Bourlai T, Poh N, Deravi F, Ng MWR, Fairhurst MC, Hennebert J, Humm A, Tistarelli M, Brodo L, Richiardi J, Drygajlo A, Ganster H, Sukno F, Pavani SK, Frangi AF, Akarun L, Savran A (2010) The multiscenario multienvironment biosecure multimodal database (BMDB). IEEE Trans Pattern Anal Mach Intell 32(6): 1097–1111.

    Article  Google Scholar 

  34. Poh N, Bengio S (2006) Database, protocols and tools for evaluating score-level fusion algorithms in biometric authentication. Pattern Recogn 39(2): 223–233.

    Article  Google Scholar 

  35. Messer K, Matas J, Kittler J, Lüttin J, Maitre G (1999) XM2VTSDB: The Extended M2VTS Database In: Second international conference on audio and video-based biometric person authentication, 72–77, Washington, D.C.

  36. The banca score database. Accessed 15 Dec 2016.

  37. NIST biometric scores set. This database used to be available on website but currently it is unavailable.

  38. Auckenthaler R, Carey M, Lloyd-Thomas H (2000) Score normalization for text-independant speaker verification systems. Digit Signal Proc 10(1): 42–54.

    Article  Google Scholar 

  39. Poh N, Bengio S (2005b) F-ratio client-dependent normalisation on biometric authentication tasks In: Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 721–724.. IEEE, Philadelphia.

    Google Scholar 

  40. Fierrez-Aguilar J, Ortega-Garcia J, Gonzalez-Rodriguez J (2004) ICBA 2004. LNCS, vol 3072. In: Zhang D Jain AK (eds), 498–504.. Springer, Heidelberg.

  41. Fierrez-Aguilar J, Ortega-Garcia J, Gonzalez-Rodriguez J (2005a) Target dependent score normalization techniques and their application to signature verification. IEEE Trans Syst Man Cybern Part C Appl Rev 35(3): 418–425.

    Article  Google Scholar 

  42. Martín-Félez R, Xiang T (2012) Gait recognition by ranking In: Proc. of the European conference on Computer Vision, 328–341.. Springer Berlin Heidelberg, Florence.

    Google Scholar 

  43. DeCann B, Ross A (2013) Relating ROC and CMC curves via the biometric menagerie In: Proc. of the IEEE sixth international conference on biometrics: theory, applications and systems (BTAS), 1–8. doi:10.1109/BTAS.2013.6712705.

  44. Makihara Y, Hossain M, Muramatsu D, Yagi Y (2011) Score-level fusion based on the direct estimation of the bayes error gradient distribution In: Proc. of the International Joint Conference on Biometrics, 1–8.. IEEE, Washington D.C., USA.

    Google Scholar 

  45. Poh N, Kittler J (2012) A unified framework for biometric expert fusion incorporating quality measures. IEEE Trans Pattern Anal Mach Intell 34(1): 3–18.

    Article  Google Scholar 

  46. Grother P, Tabassi E (2007) Performance of biometric quality measures. IEEE Trans Pattern Anal Mach Intell 29(4): 531–543.

    Article  Google Scholar 

  47. Alonso-Fernandez F, Fierrez J, Ortega-Garcia J (2012) Quality measures in biometric systems. Secur Priv IEEE 10(6): 52–62.

    Google Scholar 

  48. Wong R, Poh N, Kittler J, Frohlich D (2010) Interactive quality-driven feedback for biometric systems In: Proc. of the IEEE 4th International Conference on Biometrics: Theory, Applications, and Systems, 1–6.. IEEE, Washington D.C, USA.

    Google Scholar 

  49. Bhatt HS, Bharadwaj S, Vatsa M, Singh R, Ross A, Noore A (2011) A framework for quality-based biometric classifier selection In: Proc. of the International Joint Conference on Biometrics, 1–7.. IEEE, Washington D.C.

    Google Scholar 

  50. Nandakumar K, Chen Y, Dass S, Jain A (2006) Quality-based score level fusion in multibiometric systems In: Proc. of the International Conference on Pattern Recognition, vol 4, 473–476.. IEEE, Hong Kong.

    Google Scholar 

  51. Kryszczuk K, Drygajlo A (2007) Improving classification with class-independent quality measures: Q-stack in face verification In: Proc. of the International Conference on Biometrics, 1124–1133.. Springer Berlin Heidelberg, Seoul.

    Google Scholar 

  52. Poh N, Kittler J, Bourlai T (2007b) Improving biometric device interoperability by likelihood ratio-based quality dependent score normalization In: Prof. of the IEEE 3rd International Conference on Biometrics: Theory, Applications and Systems, 1–5.. IEEE, Crystal City, VA.

    Google Scholar 

  53. Poh N, Heusch G, Kittler J (2007a) On combination of face authentication experts by a mixture of quality dependent fusion classifiers In: Proc. of the 7th International Workshop on Multiple Classifier Systems, 344–356.. Springer-Verlag, Prague, Czech Republic.

    Chapter  Google Scholar 

  54. Makihara Y, Yagi Y (2008) Silhouette extraction based on iterative spatio-temporal local color transformation and graph-cut segmentation In: Proc. of the 19th International Conference on Pattern Recognition.. IEEE, Tampa, Florida USA.

    Google Scholar 

  55. Intel open source computer vision library. Accessed 10 Aug 2016.

  56. Han J, Bhanu B (2006) Individual recognition using gait energy image. 282: 316–322.

    Google Scholar 

  57. Zou WW, Yuen PC (2012) Very low resolution face recognition problem. IEEE Trans Image Process 21(1): 327–340.

    Article  MathSciNet  Google Scholar 

  58. Wang Z, Miao Z, Wu QJ, Wan Y, Tang Z (2014) Low-resolution face recognition: a review. Vis Comput 30(4): 359–386.

    Article  Google Scholar 

  59. Tan C, Kumar A (2013) Towards online iris and periocular recognition under relaxed imaging constraints. IEEE Trans Image Process 22(10): 3751–3765.

    Article  MathSciNet  Google Scholar 

  60. Ito K, Aoki T, Hosoi T, Kobayashi K (2011) Face recognition using phase-based correspondence matching In: Proc. of the IEEE International Conference on Automatic Face Gesture Recognition and Workshops, 173–178.. FG. doi:10.1109/FG.2011.5771393.

  61. Asthana A, Jones M, Marks T, Tieu K, Goecke R (2011) Pose normalization via learned 2D warping for fully automatic face recognition In: Proc. of the British Machine Vision Conference, 1–11.

  62. Tzimiropoulos G, Pantic M (2014) Gauss-newton deformable part models for face alignment in-the-wild In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1851–1858.

  63. Ulery B, Fellner W, Hallinan P, Hicklin A, Watson C (2006) Studies of biometric fusion. appendix c. evaluation of selected biometric fusion techniques. NIST Interagency Rep 7346: 1–14.

    Google Scholar 

  64. Neyman J, Pearson E (1933) On the problem of the most efficient tests of statistical hypotheses. Philos Trans R Soc Lond 231: 289–337.

    Article  MATH  Google Scholar 

  65. Nandakumar K, Jain AK, Ross A (2009) Fusion in multibiometric identification systems: What about the missing data? In: Proc. of the 3rd International Conference on Biometric, 743–752.

  66. DeCann B, Ross A (2012) Can a “poor” verification system be a “good” identification system? a preliminary study In: Proc. of the IEEE International Workshop on Information Forensics and Security, 31–36. doi:10.1109/WIFS.2012.6412621.

  67. Wang C, Zhang J, Wang L, Pu J, Yuan X (2012) Human identification using temporal information preserving gait template. IEEE Trans Pattern Anal Mach Intell 34(11): 2164–2176. doi:10.1109/TPAMI.2011.260.

    Article  Google Scholar 

  68. Bashir K, Xiang T, Gong S (2009) Gait representation using flow fields In: Proc. of the 20th British Machine Vision Conference, 1–11.

  69. Guan Y, Li CT, Choudhury S (2013) Robust gait recognition from extremely low frame-rate videos In: Proc. of the International Workshop on Biometrics and Forensics (IWBF), 1–4. doi:10.1109/IWBF.2013.6547319.

  70. Zhou X, Bhanu B (2007) Integrating face and gait for human recognition at a distance in video. IEEE Trans Syst Man Cybern Part B Cybern 37(5): 1119–1137. doi:10.1109/TSMCB.2006.889612.

    Article  Google Scholar 

  71. Kryszczuk K, Poh N (2010) Handling high dimensionality in biometric classification with multiple quality measures using locality preserving projection In: IEEE Computer Society and IEEE Biometrics Council Workshop on Biometrics 2010, 1–8.. IEEE, San Francisco.

    Google Scholar 

Download references


This work was partly supported by JSPS Grants-in-Aid for Scientific Research (A) 15H01693, “R&D Program for Implementation of Anti-Crime and Anti-Terrorism Technologies for a Safe and Secure Society”, Strategic Funds for the Promotion of Science and Technology of the Ministry of Education, Culture, Sports, Science and Technology, the Japanese Government, and the JST CREST “Behavior Understanding based on Intention-Gait Model” project.

Author information

Authors and Affiliations



MZU generated score database. MZU and TK participated in the accuracy evaluation for each modality and score-level fusion. MZU, DM, and YM prepared the manuscript. MZU, DM, YM, and YY analyzed and discussed the evaluated accuracy. YM and YY designed the study. All authors read and approved the manuscript.

Corresponding author

Correspondence to Md. Zasim Uddin.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Uddin, M., Muramatsu, D., Kimura, T. et al. MultiQ: single sensor-based multi-quality multi-modal large-scale biometric score database and its performance evaluation. IPSJ T Comput Vis Appl 9, 18 (2017).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: