 Research Paper
 Open Access
 Published:
Pedestrian segmentation based on a spatiotemporally consistent graphcut with optimal transport
IPSJ Transactions on Computer Vision and Applications volume 11, Article number: 10 (2019)
Abstract
We address a method of pedestrian segmentation in a video in a spatiotemporally consistent way. For this purpose, given a bounding box sequence of each pedestrian obtained by a conventional pedestrian detector and tracker, we construct a spatiotemporal graph on a video and segment each pedestrian on the basis of a wellestablished graphcut segmentation framework. More specifically, we consider three terms as an energy function for the graphcut segmentation: (1) a data term, (2) a spatial pairwise term, and (3) a temporal pairwise term. To maintain better temporal consistency of segmentation even under relatively large motions, we introduce a transportation minimization framework that provides a temporal correspondence. Moreover, we introduce the edgesticky superpixel to maintain the spatial consistency of object boundaries. In experiments, we demonstrate that the proposed method improves segmentation accuracy indices, such as the average and weighted intersection of union on TUD datasets and the PETS2009 dataset at both the instance level and semantic level.
Introduction
Silhouette extraction or human body segmentation is widely conducted as the first step in many highlevel computer vision tasks of video surveillance systems, such as human tracking [1–4], human action recognition [5–8] and gaitbased identification and recognition [9–11]. In human tracking, the extracted human silhouette is used for human fullbody localization or human part localization [1–4]. In human action recognition, studies [5, 7, 8] have directly extracted features from a silhouette sequence; Chaaraoui et al. [6] used contour points of the human silhouette for action representation. For gaitbased identification and verification, Collins et al. [9] used the silhouette for shape matching; Chen et al. [2] extracted features from the spatiotemporal silhouette for gait recognition while Liu et al. [11] proposed the average silhouette as a feature for recognition.
Pedestrian silhouette extraction has long been studied. This research mainly falls into three categories: supervised methods, unsupervised methods, and semisupervised methods.
Supervised methods [12, 13] have performed well in recent years. A typical approach of supervised pedestrian silhouette extraction requires a manually annotated mask of the target in the first frame and propagates the mask frame by frame. An automatic surveillance system, however, cannot adopt manual annotation.
Unsupervised methods, including methods based on background subtraction (e.g., [14, 15]) and motion segmentation (e.g., [16–19]), are the most popular approaches because they do not require manual annotation. Methods based on background subtraction model the background using statistical models (e.g., a Gaussian mixture model) and extract the silhouettes of moving targets as the foreground. However, methods based on background subtraction only classify the moving target and background and do not realize instancelevel silhouette extraction. Multilabel motion segmentation assigns human labels to sparse points or pixels according to motion information (e.g., optical flow), allowing targets with different motion patterns to be discriminated. However, because of the lack of object detection information, motion segmentation still cannot discriminate pedestrians with the same motion pattern (e.g., pedestrians walking in the same direction side by side) and may sometimes assign different labels to human parts with different motion patterns. Motion segmentation therefore suffers from undersegmentation and oversegmentation.
Semisupervised methods that do not require a manually annotated silhouette at the first frame but a bounding box trajectory are more suitable for pedestrian silhouette extraction by an automatic surveillance system, because the trajectory of the bounding box can be automatically extracted using recently advanced approaches of object detection [20–22] and multipleobject tracking [23–25]. To the best of our knowledge, semisupervised methods use optical flow to maintain temporal consistency (e.g., [26]). Because optical flow sometimes fails in handling large displacement, opticalflowbased semisupervised approaches often suffer segmentation errors for human parts having large displacement (e.g., a pedestrian’s leg and arm). Moreover, a conditional random field (CRF) framework that uses a colorbased Gaussian mixture model (GMM) for the background data term and a simple linear iterative clustering (SLIC) superpixel [27] as nodes in the CRF has been adopted [26]. However, color information is not enough for modeling a nonhuman region (e.g., when a pedestrian and the background have similar colors) and the SLIC superpixel sometimes cannot preserve the object boundary well, which is vital for construction of the spatial pairwise term.
We therefore proposed a semisupervised method that not only handles large displacement but also better preserves the pedestrian’s boundary. Given the pedestrian bounding box tracklets, we construct a conditional random field for silhouette extraction that involves a data term, spatial pairwise term, and temporal pairwise term. The contributions of this paper are as follows.
Optimal transport (OT)based temporal consistency. In contrast to most related work, we adopt OT to maintain temporal consistency. The lack of capacity in terms of handling large displacement is a main drawback of optical flow. Although there are methods that improve the handling of large displacement (e.g., the pyramid strategy [28]), the motion of leg and arm parts still cannot be described correctly. Compared with conventional optical flows, the proposed method successfully handles large displacement between two frames thanks to the global optimal property of the OT framework. As far as we know, the OT framework is usually used to measure the difference between two discrete distributions (e.g., a dissimilarity measure between two color histograms), which is also known as the earth mover’s distance. The proposed method does not use the final outcome of the OT framework (i.e., a distance) but the "process" of the OT framework (i.e., flow (or correspondence) between two frames), which is the primal novelty of the proposed method.
Combination of the edgesticky superpixel (ESS) and OT. The time complexity of the OT increases as the dimension of the discrete distributions (e.g., the number of bins of histograms) increases, and direct application of the OT to pixelwise image representation is computationally prohibited. We therefore need to appropriately transform the input image into a discrete distribution with a relatively low dimension. Superpixel segmentation is one such effective way to represent an image as a discrete distribution while keeping information, that is, compressing redundancy. More specifically, we regard an input image as a histogram, where the number of superpixels is the number of bins, a gravity center of a superpixel is a representative value of a bin, and a number of pixels (area) of a superpixel is the frequency (or vote) for a bin. Moreover, superpixel segmentation needs to well preserve object boundaries for our final goal, that is, pedestrian silhouette extraction. Stateoftheart superpixel segmentation methods (e.g., the SLIC superpixel [27] and superpixels extracted via energydriven sampling (SEEDS) superpixel [29]) provide a balance between appearance and shape regularity, and usually perform well in computer vision tasks. However, this balance between appearance and shape regularity does not always guarantee that the object boundary is well preserved. Our ultimate target is to extract pedestrians’ silhouettes, and we thus need to adopt a superpixel segmentation method that better preserves object boundaries. We therefore adopt the ESS, which introduces edge detection information explicitly into the process of superpixel generation. As a result, the object boundary can be preserved well while balancing the appearance and shape regularity.
Performance improvement on segmentation benchmarks. We demonstrate that the proposed method improves the performance of pedestrian silhouette extraction at both the instance level and semantic level on public datasets compared with stateoftheart methods.
Related work
The silhouette extraction or human segmentation of multiple pedestrians has been addressed in the literature [12, 13, 16, 26, 30–32]. We categorize typical approaches as follows: ∙Supervised methods. Supervised methods perform well in video segmentation. The most popular framework [12, 13] is to manually annotate the target’s mask in the first frame and propagate the target mask to other frames. In [13], a twobranch approach was proposed whereby the features from ResNet101 [33] and FlowNet [34] were combined for joint object segmentation and optical flow estimation. In [12], a method of framebyframe object segmentation was implemented by learning the appearance of the annotated object. However, because the mask annotation has a manual burden, it is difficult to apply supervised methods to pedestrian silhouette extraction in an automatic surveillance system. ∙Unsupervised methods. Unsupervised methods require no manual annotation and hence can be applied directly to an automatic surveillance system. Most unsupervised methods are based on motion information. The temporal superpixel [35] involves optical flow into a superpixel segmentation framework to realize a temporally consistent superpixel. Ochs et al. [16] adopted a twostep approach: generate sparse segments by clustering longterm trajectories and then obtain dense segments according to color. However, the temporal superpixel is a superpixel segmentation and thus requires a manual annotator that specifies the pedestrian’s superpixel, which is again not possible for an automatic surveillance system. Ochs’s approach [16] is also prone to undersegmentation because multiple pedestrians walking in the same direction are likely to be segmented into an identical segment. ∙Semisupervised methods. Compared with supervised and unsupervised methods, semisupervised methods that only require a bounding box annotation are more suitable for silhouette extraction by a realworld surveillance system. Milan [26] exploited a joint tracking and segmentation method that first applies superpixel segmentation and multiplepedestrian tracking. A CRF is then constructed and all superpixels are assigned with the labels of pedestrian trajectories. Because optical flow is used in the construction of the CRF, Milan’s approach sometimes fails for pedestrian’s legs, for which there is large spatial displacement. ∙Pedestrian segmentation methods for a single frame. In recent years, great strides have been made in cellular neural network (CNN)based image semantic segmentation and instance segmentation. In [31], a multipath refinement network was presented where CNN features with multiple resolutions are fused so that semantic features can be refined using lowerlevel features. In [32], an object detection network [20] is concatenated by a fully convectional network [36] so that object detection and instancelevel segmentation can be achieved jointly. Singleframe segmentation methods can therefore be easily extended to pedestrian silhouette extraction in video using bounding box trajectories.
Proposed method
Problem setting
The present study presents a method of extracting silhouettes of multiple pedestrians from a video. We assume that the cameras are static and the bounding box trajectories are given by wellestablished detectors [20] and trackers [23].
Framework
We adopt a twostep framework that consists of superpixel segmentation and superpixelwise labeling. The whole framework is shown in Fig. 1.
Superpixel segmentation. Given an input image sequence, superpixel segmentation is first applied frame by frame to reduce the computational cost. We adopt the ESS, which better preserves object boundaries.
Superpixelwise labeling. Given the superpixel segmentation result and pedestrian trajectories (i.e., a bounding box sequence for a pedestrian), each superpixel is assigned with a trajectory label (i.e., a pedestrian label) in this step, resulting in instancelevel segmentation as shown in Fig. 1f.
The label assignment problem has been well studied for decades and recent progress expanded its application area to many computer vision tasks. As an example, Wu [37] proposed an adaptive label assignment method to handle the “one example human reidentification” problem where there is only one example available for each human identity, that is, the labeled data. The adaptive label assignment method can both select a set of candidates from the unlabeled data and assign labels of the candidates using a nearest neighbors (NN) classifier in the feature space extracted by the CNN model.
However, in the present work, we cannot generate a set of "labeled data" as in [37] owing to the different problem settings. Furthermore, spatiotemporal consistency is strongly required in the present work, and pairwise features that maintain spatiotemporal consistency (e.g., edgebased features) can only be extracted in a pairwise manner instead of using the independently extracted features. As a result, the approach in [37] cannot be applied directly in the superpixelwise labeling step of the present work.
To better handle the features extracted in a pairwise manner, we adopt the wellestablished CRF for superpixelwise labeling. The label assignment problem is then formulated as a CRF problem and solved using the graphcut with αexpansion algorithm.
Details are discussed in the following subsections.
ESS
The superpixel is a popular technology used to reduce the redundancy of an image and is employed in many computer vision applications. We use the superpixel because not only does it reduce the computational complexity but also it preserves object boundaries.
Stateoftheart approaches (e.g., the SEEDS superpixel [29] and SLIC superpixel [27] approaches) balance the spatial and appearance consistency. However, such balance sometimes affects the capacity to preserve object boundaries. It is therefore necessary to involve edge information when there is a strong need to preserve the object boundary. In this research, we adopt the ESS, which is an extension of Pitor’s work [38]. Because there is no corresponding publication^{Footnote 1}, we provide a simple illustration of the ESS. We describe the details of the ESS along with Fig. 2 in the following paragraphs.
We denote a set of pixels in frame t by \(\mathcal {P}^{t}=\{p_{i} \vert i \in \mathcal {L}_{P}^{t}\}\), where \(\mathcal {L}_{P}^{t}\) is a set of the indices of pixels in frame t (i.e., the number of elements of \(\mathcal {L}_{P}^{t}\) is the image size), t∈{1,2,...,T}, where T is the total frame number and p_{i} is the ith pixel. Moreover, a set of superpixel indices in frame t is denoted \(\mathcal {L}_{\text {SP}}^{t}\). The superpixel segmentation in frame t can then be formulated as
where each pixel is assigned with the label of a superpixel (i.e., the index of a superpixel).
We first initialize each pixel as a superpixel; i.e., \(X_{\text {SP}}^{t}(i) = i ; \forall i \in \mathcal {L}_{P}^{t}\). Then, for each pixel (e.g., the ith pixel), we calculate the cost c(i,l) of assigning a neighboring superpixel’s label l to the ith pixel considering the spatial proximity, appearance similarity, edge consistency, and superpixel size as
where α, β, and γ are hyperparameters. The location and appearance vector for the ith pixel are denoted v_{loc}(i) and v_{app}(i), while the mean location and appearance vector for the lth superpixel are denoted μ_{loc}(l) and μ_{app}(l). Moreover, c_{edge} is the edge cost and A_{l} is the size of the lth superpixel.
The first and second terms of Eq. (2) maintain the spatial consistency of the superpixel, while the third term controls the size of the superpixel.
The last term helps to preserve the object boundary by involving the edge probability. The edge probability is calculated using structured edge detection (SED) [38]. SED is briefly introduced together with Fig. 3 below.
SED firstly separates an input image into a set of image patches. A pretrained random forest is then applied to the set of image patches to achieve a set of binary edge masks as shown in Fig. 3b. Finally, the set of edge masks are aggregated to generate the edge probability (i.e., the edge detection result) as shown in Fig. 3c. We refer the reader to [37] for more details.
The edge probability of the ith pixel in frame t is denoted \(p_{\text {edge}}^{t}(i)\) and the edge cost function c_{edge}(i,l) is then defined as
where the set of fourconnected neighbors of the ith pixel is denoted n_{4}(i) and the set of corresponding superpixel labels is \(l_{4}(i) = \left \{X_{\text {SP}}^{t}(j)\vert j \in n_{4}(i)\right \}\). Details of the edge cost function will be described along with Fig. 4.
Figure 4 shows that the ith pixel’s fourconnected neighbors are j_{1} (whose superpixel label is l_{1}) and j_{2}, j_{3}, and j_{4} (whose superpixel labels are l_{2}). The edge probability is represented in pseudocolor, where the edge probability for a red pixel is 0.9 while that for a blue pixel is 0.1, i.e., there is an edge on the left side of the ith pixel. According to Eq. 3, c_{edge}(i,l_{1}) = − 0.1 and c_{edge}(i,l_{2}) = − 0.9, it is more difficult to assign the label l_{1} than the label l_{2} to the ith pixel. As a result, the edge cost function helps preserve the object boundary.
We repeat this process until \(X_{\text {SP}}^{t}\) stops changing. An example of an ESS result is shown in Fig. 5. We see that the object boundaries (e.g., boundaries between a pedestrian and background) are well preserved.
After obtaining the superpixels for each frame independently, the set of all superpixel labels is defined as \(\mathcal {L}_{\text {SP}} = \mathop {\bigcup }_{t=1}^{T}{\mathcal {L}^{t}_{\text {SP}}}\). Moreover, we denote the set of all pixels as \(\mathcal {L}_{\mathrm {P}} = \mathop {\bigcup }_{t=1}^{T}{\mathcal {L}^{t}_{\mathrm {P}}}\). For simplicity, the superpixel segmentation for all frames is defined as
Superpixelwise labeling
Given superpixel segmentation results and a set of bounding box sequences for n_{TR} pedestrians \(\text {TR} = \left \{tr_{i}  i \in \mathcal {L}_{\text {TR}}\right \}\), where tr_{i} is the bounding box trajectory for the ith pedestrian, we consider mapping the superpixel labels \(\mathcal {L}_{\text {SP}}\) into one of the pedestrian labels \(\mathcal {L}_{\text {TR}} = \left \{l_{1}^{\text {TR}},..., l_{n_{\text {TR}}}^{\text {TR}}\right \}\), where \(l_{m}^{\text {TR}}\) is the mth pedestrian’s label, or a background label \(l_{\text {BG}}^{\text {TR}}\). For simplicity, we denote all labels by \(\hat {\mathcal {L}}_{\text {TR}} = \mathcal {L}_{\text {TR}} \bigcup \left \{l_{\text {BG}}^{\text {TR}}\right \}\). The problem of mapping from superpixels’ labels \(\mathcal {L}_{\text {SP}}\) to \(\hat {\mathcal {L}}_{\text {TR}}\) (i.e., the superpixelwise labeling problem) can be formulated as
We then formulate the problem of optimizing X_{CRF} as a multilabel CRF problem:
where the energy function E(X_{CRF}) is defined as
Here, the first term is the data term while the second and third terms are respectively spatial and temporal pairwise terms. ω_{S} and ω_{T} are respectively the weights of spatial and temporal pairwise terms. The definitions of \( \mathcal {N}_{\mathrm {S}}\), \(\mathcal {N}_{\mathrm {T}}\), E_{Data}, E_{S}, and E_{T} are explained in the following sections.
The multilabel CRF problem can then be solved using the graphcut with αexpansion algorithm [39], which is widely used for CRF inference. The algorithm iterates each possible label (i.e., the label α in a given CRF), and in each iteration, the algorithm segments the α and the non α components with the graphcut. The energy function of the CRF in this work contains spatial and temporal pairwise terms, and the graphcut with αexpansion algorithm is thus adopted in a spatiotemporally consistent way.
Data term
The data term defined as
contains two components, namely a pedestrian term \(E_{\text {Data}}\left (p,X_{\text {CRF}}(p)\not =l_{\text {BG}}^{\text {TR}}\right)\) and background term \(E_{\text {Data}}\left (p, X_{\text {CRF}}(p)=l_{\text {BG}}^{\text {TR}}\right)\) for an arbitrary superpixel p.
We use RefineNet [31], a CNNbased semantic segmentation method, for the background term. Given an input image, RefineNet predicts the pixelwise probability distribution of a set of object classes. In this work, we adopt a pretrained model on the Cityscapes dataset [40] using Residual Net (ResNet) [33], which contains 20 object classes. We extract the probability of the label "person" in the input image denoted p_{Hm}(i) for the ith pixel. The pixelwise human score of the ith pixel is then defined as
The superpixelwise human score of the pth superpixel is defined as the mean pixelwise human score of the pixels inside the pth superpixel, which is denoted h_{Hm}(p). An example of the pixelwise and superpixelwise human score map is shown in Fig. 6. It is clear that the superpixelwise human score map can be directly used as the background data term:
We subsequently sample and train a GMM for multiple pedestrians to define the pedestrian term. We denote a set of pixels belonging to the kth superpixel as u_{k}={iX_{SP}(i)=k} and pixels inside the bounding box trajectory of the ith pedestrian t_{i} as \(\mathcal {U}_{i}\). If the kth superpixel overlaps with the bounding box sequence of the ith pedestrian t_{i} (i.e., \(u_{k} \bigcap \mathcal {U}_{i}\not =\emptyset \)), it is sampled for the GMM training of the ith pedestrian. A superpixel may sometimes overlap with multiple trajectories and we thus adopt a winnertakesall strategy by which the pedestrian closest to the camera (i.e., the pedestrian with the lowest bound of the bounding box) takes the superpixel.
After the superpixel sampling, we train the GMM for each trajectory according to the mean color of the superpixel. θ_{i} denotes the GMM parameters of the ith pedestrian. Moreover, we hypothesize that all superpixels outside the bounding box t_{i} are hard to be assigned with pedestrian label \(l_{i}^{\text {TR}}\); therefore, the pedestrian term for those superpixels is set with a sufficiently large constant. Finally, the pedestrian term is defined as
where C is a sufficiently large constant and p_{GMM}(μ_{app}(p);θ_{i}) is the probability density of the mean appearance μ_{app}(p) of the pth superpixel for the ith pedestrian. An example of the pedestrian term is shown in Fig. 7
Spatial pairwise term
The spatial pairwise term
is used to maintain the spatial consistency of X_{CRF}. A set of spatial neighbors \(\mathcal {N}_{\mathrm {S}}\) is first defined as
where conn_{S}(p,q) is the spatial connectivity function and is defined as
We then use the color and edge probability to formulate the spatial pairwise energy function E_{S}.
A colorbased pairwise energy function is defined as
and following previous work [41], a parameter λ is subsequently defined as
to adapt to high and low color contrast. An example of colorbased pairwise energy is shown in Fig. 8.
The colorbased pairwise energy function may sometimes fail to maintain spatial consistency when the colors of different pedestrians or a pedestrian and the background are similar as shown in the white bounding box in Fig. 8. We therefore further include the edge probability in the spatial pairwise energy function.
We denote by p_{edge}(j) the edge probability at the jth pixel. An edgebased pairwise energy function is subsequently defined as
where \(p, q\in \mathcal {L}_{\text {SP}}\) and <.>_{p, q} denote the expectation over the pixels on the boundary between two spatially neighboring superpixels p and q. An example of the edgebased pairwise energy function is shown in Fig. 9. The boundary of the pedestrian inside the white bounding box in Fig. 9 is better preserved than that in Fig. 8.
Finally, the spatial pairwise energy is defined as the weighted sum of colorbased and edgebased pairwise energy:
where ω_{e} is a hyperparameter that controls the weight of edgebased spatial pairwise energy.
OTbased temporal pairwise term
A temporal pairwise term defined as
is introduced for the purpose of maintaining the temporal consistency of superpixelwise labeling. \(\mathcal {N}_{\mathrm {T}}\) is a set of temporal neighbors defined as
where conn_{T} is the temporal connectivity function.
Different from spatial connectivity, which can be easily defined according to the pixel lattice structure, the temporal connectivity must involve object motion information. To the best of our knowledge, optical flow is the most popular motion information used to define temporal connectivity. However, optical flow usually fails to handle the large displacement that often occurs for the pedestrian leg and arm. We therefore introduce OTbased temporal connectivity for better motion estimation.
The OT distance, also known as the earth mover’s distance, is a useful distance with which to compare two probability distributions. The OT problem is described as follows.
Given two probability distributions r=[r_{1},...,r_{m}]^{T} and c=[c_{1},...,c_{n}]^{T} and a cost matrix \(M \in \mathbb {R}_{+}^{m\times n}\), the OT problem is to find a transportation matrix P^{∗} such that
where 〈·,·〉_{F} denotes the Frobenius dot product. \(\mathcal {P}(r, c)=\left \{ P\in \mathbb {R}_{+}^{m\times n} \vert P\boldsymbol {1}_{n}=\boldsymbol {r}, P^{T}\boldsymbol {1}_{m}=\boldsymbol {c}\right \}\), where 1_{m} and 1_{n} are m and ndimensional vectors of ones.
In this study, we formulate motion estimation as an OT problem. We denote superpixel labels in frame t by \(\mathcal {L}_{\text {SP}}^{t} = \left \{l_{1}^{t},..., l_{\mathcal {L}_{SP}^{t}}^{t}\right \}\) and then define a superpixel size vector in frame t as \(\hat {\boldsymbol {A}}^{t} = \left [A^{t}_{1},... A^{t}_{\mathcal {L}_{SP}^{t}}\right ]\), where \(A^{t}_{i}\) is the size of the \(l_{i}^{t}\)th superpixel. The normalized size vector is then defined as \(\boldsymbol {A}^{t}=\hat {\boldsymbol {A}}^{t}/\mathcal {L}_{P}^{t}\). Because A^{t}=1 and \(\forall i\in \left \{1,..., \mathcal {L}_{SP}^{t}\right \}, \boldsymbol {A}^{t}(i)\geq 0\), A^{t} is a probability distribution. We therefore treat the normalized size vector in two consecutive frames A^{t} and A^{t+1} as the input of an OT problem.
Moreover, the cost matrix between frames t and t+1 is defined as
where m(i,j) is defined as
The first item of m(i,j) encourages transportation between spatially nearer superpixels while the second term encourages transportation between superpixels that appear similar. Furthermore, we include the third term to encourage transportation between superpixels in the pedestrian region.
The OT between frames t and t+1 is defined as
Subsequently, the temporal connectivity is defined as
where th_{temp} is the threshold of temporal connectivity, \(l_{i}^{t} \in \mathcal {L}_{\text {SP}}^{t}\) and \(l_{j}^{t+1} \in \mathcal {L}_{\text {SP}}^{t+1}\). An example of OTbased temporal connectivity is shown in Fig. 10. In Fig. 10a, we manually select the superpixels belonging to a pedestrian in frame t. All the temporally connected superpixels are shown in Fig. 10b. The temporal consistency is well preserved by the OTbased temporal connectivity.
Finally, the temporal pairwise energy is defined as
where the definition of λ is the same as in Eq. (16).
Experiments
Experimental setting
Datasets
We test our proposed method on four publicly available image sequences: TUDStadtmitte, TUDCampus, TUDCrossing and PETS2009 S2L1. Each sequence contains a longterm occlusion that makes segmentation highly challenging. Furthermore, TUDStadtmitte and TUDCampus present the challenges of low contrast and similar clothing.
We use manually annotated pedestrian bounding box trajectories for each dataset when we test the proposed method as well as the other baseline methods. We also annotate groundtruth pedestrian silhouettes (instance segmentation) for the evaluation of pedestrian silhouette extraction.
Evaluation metrics
For the instancelevel evaluation, we adopt mean and weighted intersections over union (M.IoU and W.IoU) to evaluate experimental results. M.IoU is a measure of the instancewise IoU for each groundtruth instance averaged over all frames while W.IoU further weights the sizes of segments.
For simplicity, we denote the mapping from the pixel to the pedestrian’s label as the composition of superpixel segmentation and superpixelwise labeling:
The set of pixels assigned with pedestrian’s label i (i.e., the masktype result of pedestrian i) is then denoted y_{i}={pX(p)=i}. Correspondingly, the ground truth set of the pixel of pedestrian i is \(y_{i}^{*}\). M.IoU and W.IoU are defined as
where n_{TR} is the number of pedestrian trajectories.
For the semanticlevel evaluation, we use the pedestrian IoU (P.IoU) to illustrate that the proposed method improves the semanticlevel segmentation performance as
Furthermore, we compute IoUs along the boundary regions to verity that the object boundaries are well preserved as suggested. For this purpose, we define a boundary region of a pedestrian silhouette as a subtracted region between a dilated region and an eroded region (see Fig. 12 for examples) and then define the IoU for the boundary region. More specifically, for instancelevel evaluation, given the ith pedestrian region y_{i}, we compute the dilated region \(y_{i}^{Di}\) and also the eroded region \(y_{i}^{Er}\) and then compute the boundary region \(y_{i}^{B}\) as \(y_{i}^{B}=\ y_{i}^{Di}\setminus y_{i}^{Er}\). We similarly define the boundary region \(y_{i}^{B\ast }\) of the groundtruth region \(y_{i}^{B\ast }\) for the ith pedestrian. We then define the mean IoU along the boundary regions (denoted M.IoU_{B}) as
where n_{TR} is the number of pedestrian trajectories. For semanticlevel evaluation, we similarly define the pedestrian IoU for the boundary region (denoted P.IoU_{B}) as
Finally, we adopt the computational time as an evaluation metric with which to quantitatively analyze the efficiency of the proposed method.
Baseline methods
For instancelevel segmentation, we adopt the methods of Milan et al. [26], He et al. [32] and Ochs et al. [16] as baseline methods. For fair comparison, we modify the baseline methods as follows.
Milan’s method generates an overcomplete set of trajectory hypotheses and then assigns superpixels to trajectories. We substitute the trajectory hypothesis with the trajectory ground truth and eliminate the update of the trajectory.
He’s method and Ochs’s method have different pedestrian labeling schemes and thus need to be relabeled. We use a greedy assignment method by which, from the largest groundtruth segments to the smallest, we assign label i of trajectory tr_{i} to the segment with the highest IoU with \(y_{i}^{*}\). Moreover, because He’s method generates multicategory instancelevel segmentation, we apply the greedy assignment to both human segments and bag segments for the reason that the ground truth of the pedestrian contains both human and bag regions.
We adopt Lin’s method (i.e., RefineNet [31]) as a baseline method for the semanticlevel segmentation. We use a pretrained model on the Cityscapes dataset [40] whose output contains 20 labels. We focus only on the quality of the pedestrian silhouette and thus convert the original RefineNet output into a binary mask that only contains the "human" label and “nonhuman” label. An example of the binary mask is shown in the second column of Fig. 15.
Implementation details
The pedestrian bounding box trajectories used in the experiment are manual annotations. For the ESS, we set α=0.7 and β=0.7, and to keep the average size of superpixels the same, we set γ=545 for TUDStadtmitte, γ=560 for TUDCampus, γ=475 for TUDCrossing and γ=300 for PETS2009; i.e., there are approximately 2000 superpixels per frame for TUD datasets and 2850 per frame for PETS2009.
Thresholds th_{Hm} and th_{temp} are set as 0.5. In the spatial pairwise term, ω_{e} is set as 300 while for CRF, ω_{S} is set as 8 and ω_{T} is set as 12. Finally, to handle an arbitrary length of frames, we use a batch process that sets the batch length as 20 frames.
Both instancelevel and semanticlevel evaluations are conducted on a personal computer with an Intel I7 CPU, 64 GB memory and a NVIDIA GTX 1080Ti GPU. We further address the use of the GPU for each method as follows.
For Ochs’s method and Milan’s method, GPUs are not used in the computation because no GPU version of codes was provided. For He’s method, the experiments are conducted using GPUs. For the proposed method, we only use a GPU for the RefineNetbased background term and not other parts.
Component comparison
Superpixel
To demonstrate the merits of the ESS, we run a component comparison experiment in which the SLIC superpixel [27] is used in a baseline method. We tune the number of SLIC superpixels to be the same as the number of ESSs. The experimental results presented in Table 1 show that the ESS outperforms the SLIC superpixel.
Temporal pairwise term
We run another component comparison experiment to demonstrate the merits of the OTbased temporal pairwise term compared with the opticalflowbased temporal pairwise term. We follow Liu’s work [42] for the opticalflow calculation. We then define an opticalflowbased connectivity function conn_{flow}(p,q) with which to substitute conn_{T}(p,q).
We denote the average motion vector of superpixel p in frame t as μ_{flow}(p), where \(p\in \mathcal {L}^{t}_{\text {SP}}\) and the integral rounding of the vector is [μ_{flow}(p)] with [·] being the integral rounding function. The set of pixel location vectors of the pth superpixel is \(\mathcal {V}_{p}=\left \{\boldsymbol {v}_{\text {loc}}(j) \vert X_{\text {SP}}(j) = p\right \}\), and the corresponding locations in frame t+1 obtained via [μ_{flow}(p)] are denoted \(\hat {\mathcal {V}_{p}}=\left \{ \boldsymbol {v}_{\text {loc}}(j) + \left [\boldsymbol {\mu }_{\text {flow}}(p)\right ] \vert X_{\text {SP}}(j) = p\right \}\). Moreover, denoting by \(q\in \mathcal {L}^{t+1}_{\text {SP}}\) a superpixel whose pixel location vector set is \(\mathcal {V}_{q}\), the opticalflowbased temporal connectivity function is then defined as
A set of opticalflowbased temporal neighbors is then defined as
Subsequently, the opticalflowbased temporal pairwise term is defined similarly to Eq. 19:
We then substitute the OTbased temporal pairwise term with the opticalflowbased term and run the component comparison experiment without changing other settings on the TUDCampus dataset.
The experimental results are also given in Table 1. The OTbased temporal term performs better than the opticalflowbased temporal term.
Experimental results
Instancelevel evaluation
The instancelevel experimental result is presented in Table 2 while examples of visualization masktype and edgetype results are respectively shown in Fig. 11 and Fig. 12.
The proposed method outperforms Ochs’s and Milan’s methods for all metrics. On the TUD datasets, the proposed method outperforms He’s method in terms of W.IoU and M.IoU_{B} while underperforming He’s method in terms of M.IoU. Furthermore, on the PETS2009 dataset, the proposed method fails to outperform He’s method.
The performance of the proposed method compared with He’s method is explained below.
The sizes of ESSs are almost the same because the third term in Eq. 2 controls the size of the superpixels. Therefore, more superpixels are used to represent a larger pedestrian; i.e., a larger pedestrian is more robust against superpixel labeling error. As a result, our proposed method works better on large pedestrians than small pedestrians as shown in Fig. 13 and Fig. 14. Because the TUD datasets have a higher proportion of large pedestrians, compared with He’s method, the proposed method has a higher W.IoU, which gives a higher weight to the large pedestrian and results in a lower M.IoU, which normalizes the size of the pedestrian. However, on the PETS2009 dataset, because most pedestrians are small, the proposed method fails to outperform He’s method in terms of both M.IoU and W.IoU. The equally sized ESSs are considered the main drawback of the proposed method.
Another drawback of our proposed method is a lack of ability to handle occlusion. Figure 15 shows that the proposed method fails to segment the two pedestrians in frame t because of heavy occlusion. This relates to our adoption of a colorbased GMM for pedestrian modeling, which may fail when the appearances of two pedestrians are similar.
Semanticlevel evaluation
We also run a semanticlevel experiment to illustrate that the use of the proposed method improves the performance of semantic segmentation. Results are presented in Table 3. On all datasets, the proposed method has a much better P.IoU and P.IoU_{B}. This is because not only does the ESS preserve the pedestrian boundary better but also the OTbased temporal pairwise term eliminates the temporally inconsistent segments. Examples of semanticlevel masktype and edgetype results are shown in Fig. 16 and Fig. 17.
Sensitivity analysis
We conduct an analysis of the sensitivity of the segmentation accuracy on the number of superpixels. We test the performance for an approximately exponentially increasing superpixel number on the TUDCampus dataset and present the results in Table 4. Although the performance increases with the number of superpixels, the computational time is unacceptable if the number is too great; for example, 5000 superpixels per frame. In conclusion, 2000 superpixels per frame is considered a good tradeoff between the segmentation quality and processing time.
Conclusion
We proposed a method of extracting multiple pedestrian silhouettes. The proposed method is formulated as a CRF inference problem that incorporates the ESS, semantic segmentationbased human score, and OTbased temporal pairwise term. In addition, we tested the proposed method on public datasets and achieved competitive performance.
A detector of human parts [43] and multipledetector fusion for the tracking of multiple objects [44] have recently been developed, and a future avenue of research will apply the humanpart detector to occlusion reasoning.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Notes
Code for the ESS is released at https://github.com/pdollar/edges.
Abbreviations
 CRF:

Conditional random field
 GMM:

Gaussian mixture model
 OT:

Optimal transport
 ESS:

Edgesticky superpixel
 SLIC:

Simple linear iterative clustering
 SEEDS:

Superpixels extracted via energydriven sampling
 M. IoU:

Mean intersections over union
 W. IoU:

Weighted intersections over union
 P. IoU:

Pedestrian intersections over union
References
Plaenkers R, Fua P (2002) Modelbased silhouette extraction for accurate people tracking In: European Conference on Computer Vision, 325–339.. Springer, Berlin.
Chen X, He Z, Anderson D, Keller J, Skubic M (2006) Adaptive silhouette extraction and human tracking in complex and dynamic environments In: Image Processing, 2006 IEEE International Conference On, 561–564.. IEEE, New York.
Ahn JH, Choi C, Kwak S, Kim K, Byun H (2009) Human tracking and silhouette extraction for human–robot interaction systems. Patt Anal Appl 12(2):167–177.
Howe NR (2004) Silhouette lookup for automatic pose tracking In: Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference On, 15–22.. IEEE, New York.
Wang L, Suter D (2007) Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference On, 1–8.. IEEE, New York.
Chaaraoui AA, ClimentPérez P, FlórezRevuelta F (2013) Silhouettebased human action recognition using sequences of key poses. Patt Recogn Lett 34(15):1799–1807.
Wang L, Suter D (2007) Learning and matching of dynamic shape manifolds for human action recognition. IEEE Trans Image Process 16(6):1646–1661.
Ikizler N, Duygulu P (2009) Histogram of oriented rectangles: a new pose descriptor for human action recognition. Image Vision Comput 27(10):1515–1526.
Collins RT, Gross R, Shi J (2002) Silhouettebased human identification from body shape and gait In: Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference On, 366–371.. IEEE, New York.
Wang L, Tan T, Ning H, Hu W (2003) Silhouette analysisbased gait recognition for human identification. IEEE Trans Patt Anal Mach Intell 25(12):1505–1518.
Liu Z, Sarkar S (2004) Simplest representation yet for gait recognition: averaged silhouette In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference On, 211–214.. IEEE, New York.
Caelles S, Maninis KK, PontTuset J, LealTaixé L, Cremers D, Van Gool L (2017) Oneshot video object segmentation In: CVPR 2017.. IEEE, New York.
Cheng J, Tsai YH, Wang S, Yang MH (2017) Segflow: joint learning for video object segmentation and optical flow In: 2017 IEEE International Conference on Computer Vision (ICCV), 686–695.. IEEE, New York.
Migdal J, Grimson WEL (2005) Background subtraction using markov thresholds In: Application of Computer Vision, 2005. WACV/MOTIONS’05 Volume 1. Seventh IEEE Workshops On, 58–65.. IEEE, New York.
Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference On, 28–31.. IEEE, New York.
Ochs P, Malik J, Brox T (2014) Segmentation of moving objects by long term video analysis. IEEE Trans Patt Anal Mach Intell 36(6):1187–1200.
Narayana M, Hanson A, LearnedMiller E (2013) Coherent motion segmentation in moving camera videos using optical flow orientations In: Computer Vision (ICCV), 2013 IEEE International Conference On, 1577–1584.. IEEE, New York.
Unger M, Werlberger M, Pock T, Bischof H (2012) Joint motion estimation and segmentation of complex scenes with label costs and occlusion modeling In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference On, 1878–1885.. IEEE, New York.
Chen YM, Bajic IV (2011) A joint approach to global motion estimation and motion segmentation from a coarsely sampled motion vector field. IEEE Trans Circ Syst Vid Technol 21(9):1316–1328.
Ren S, He K, Girshick RB, Sun J (2015) Faster rcnn: Towards realtime object detection with region proposal networks In: IEEE Transactions on pattern analysis and machine intelligence, 39, 1137–1149.
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).. IEEE, New York.
Girshick R (2015) Fast RCNN In: Proceedings of the International Conference on Computer Vision (ICCV).. IEEE, New York.
Kim C, Li F, Ciptadi A, Rehg JM (2015) Multiple hypothesis tracking revisited In: Proceedings of the IEEE International Conference on Computer Vision, 4696–4704.. IEEE, New York.
Choi W (2015) Nearonline multitarget tracking with aggregated local flow descriptor In: Proceedings of the IEEE International Conference on Computer Vision, 3029–3037.. IEEE, New York.
Keuper M, Tang S, Zhongjie Y, Andres B, Brox T, Schiele B (2016) A multicut formulation for joint segmentation and tracking of multiple objects. Computing Research Repository (CoRR):1–14.
Milan A, LealTaixé L, Schindler K, Reid I (2015) Joint tracking and segmentation of multiple targets In: Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference On, 5397–5406.. IEEE, New York.
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Süsstrunk S (2012) Slic superpixels compared to stateoftheart superpixel methods. IEEE Trans Pattern Anal Mach Intell 34(11):2274–2282.
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping In: European Conference on Computer Vision, 25–36.. Springer, Berlin.
Van den Bergh M, Boix X, Roig G, Van Gool L (2015) Seeds: Superpixels extracted via energydriven sampling. Int J Comput Vis 111(3):298–314.
Makihara Y, Tanoue T, Muramatsu D, Yagi Y, Mori S, Utsumi Y, Iwamura M, Kise K (2015) Individualitypreserving silhouette extraction for gait recognition. IPSJ Trans Comput Vis Appl 7:74–78.
Lin G, Milan A, Shen C, Reid I (2017) Refinenet: multipath refinement networks for highresolution semantic segmentation In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).. IEEE, New York.
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask rcnn In: Computer Vision (ICCV), 2017 IEEE International Conference On, 2980–2988.. IEEE, New York.
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.. IEEE, New York.
Dosovitskiy A, Fischer P, Ilg E, Hausser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D, Brox T (2015) Flownet: Learning optical flow with convolutional networks In: Proceedings of the IEEE International Conference on Computer Vision, 2758–2766.. IEEE, New York.
Chang J, Wei D, Fisher III JW (2013) A video representation using temporal superpixels In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference On, 2051–2058.. IEEE, New York.
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.. IEEE, New York.
Wu Y, Lin Y, Dong X, Yan Y, Bian W, Yang Y (2019) Progressive learning for person reidentification with one example. IEEE Trans Image Process 28(6):2872–2881.
Dollár P, Zitnick CL (2013) Structured forests for fast edge detection In: Computer Vision (ICCV), 2013 IEEE International Conference On, 1841–1848.. IEEE, New York.
Á2 À. (2001) Fast approximate energy minimization via graph cuts. IEEE Trans Patt Anal Mach Intell 23(11):1.
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223.. IEEE, New York.
Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts In: ACM Transactions on Graphics (TOG), 309–314.. ACM, New York.
Liu C, et al (2009) Beyond pixels: exploring new representations and applications for motion analysis. PhD Thesis:48–50.
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multiperson 2d pose estimation using part affinity fields In: CVPR.. IEEE, New York.
Henschel R, LealTaixé L, Cremers D, Rosenhahn B (2017) Fusion of head and fullbody detectors for multiobject tracking In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1509–150909.
Acknowledgments
We thank Glenn Pennycook, MSc, from Edanz Group (www.edanzediting.com/ac) for editing a draft of this manuscript.
Funding
This work was supported by a JSPS GrantinAid for Scientific Research (A) JP18H04115.
Author information
Authors and Affiliations
Contributions
YY executed the experiments, analyzed results, and wrote the initial draft of the manuscript. MY managed the advisor position for the collection of data, designed the experiment, and reviewed the manuscript. YY supervised the design of the work and provided technical support and conceptual advice. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Yu, Y., Makihara, Y. & Yagi, Y. Pedestrian segmentation based on a spatiotemporally consistent graphcut with optimal transport. IPSJ T Comput Vis Appl 11, 10 (2019). https://doi.org/10.1186/s4107401900622
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s4107401900622
Keywords
 Pedestrian segmentation
 Edge sticky superpixel
 Optimal transport
 Conditional random field