As mentioned in the Introduction, our work is a natural extension of UCM, the state of the art 2D superpixel extraction approach. We now briefly describe the steps in 2D UCM since these are closely followed (for the most part) in our 3D effort.
UCM begins by introducing the Pb gradient-based detector which assigns a “probability” value Pb(x,y,θ) at every location (x,y) and orientation θ. This is obtained from an oriented gradient detector G(x,y,θ) applied to an intensity image. A multiscale extension mPb(x,y,θ) to the Pb detector is deployed by executing the gradient operator at multiple scales followed by cue combination over color, texture, and brightness channels. Thus far, only local information (relative to scale) has been used. In the next globalization step, a weighted graph is constructed with graph edge weights being set proportional to the evidence for a strong contour connecting the nodes. This is performed for all pairs of pixels resulting in an N2×N2 graph given N2 pixels. The top eigenvectors of the graph are extracted, placed in image coordinates followed by gradient computation. This results in the sPb(x,y,θ) detector which carries global information as it is derived from eigenvector “images.” Finally, the globalized probability detector gPb(x,y,θ) is computed via a weighted linear combination of mPb and sPb. While this completes the pipeline in terms of information accrued for segmentation, UCM then proceeds to obtain a set of closed regions using gPb as the input via the application of the oriented watershed transform (OWT). Watershed-based flood filling is performed at the lowest level of a hierarchy leading to an oversegmentation. A graph-based region merging algorithm (with nodes, edges, and weights corresponding to regions, separating arcs and measures of dissimilarity respectively) is deployed resulting in an entire hierarchy of contained segmentations (respecting an ultrametric). The UCM pipeline can be broadly divided into (i) the gPb detector, (ii) the oriented watershed transform, and (iii) graph-based agglomeration.
The pipeline in our 3D UCM framework closely follows that of 2D gPb-UCM with one major exception which will be clarified and elaborated upon below. The greater voxel cardinality forces us to revamp the globalization step above wherein a graph is constructed from every pair of voxels: if the volumetric dataset is N×N×N, the graph is N3×N3 which is too large for eigenvector computation. Therefore, computational considerations force us to adopt reduced order eigensystem solvers. A second exception concerns the agglomeration step. 2D UCM merges regions together by only considering contour edge pixels at the base level and not all pixels. This approach leads to the creation of fragmented surfaces in 3D. To overcome this problem, we perform graph-based agglomeration using all voxels following recent work. With these changes to the pipeline, the 3D UCM framework is broadly subdivided into (i) local, volume gradient detection, (ii) globalization using reduced order eigensolvers, and (iii) graph-based agglomeration to reflect the emphasis on the changed subsystems. The upside is that 3D UCM becomes scalable to handle sizable datasets.
3.1 Local gradient feature extraction
The UCM framework begins with gradient-based edge detection to quantify the presence of boundaries. Most gradient-based edge detectors in 2D [43, 44] can be extended to 3D for this purpose. The 3D gradient operator used in this work is based on the mPb detector proposed in [1, 45] which has been empirically shown to have superior performance in 2D.
The building block of the 3D mPb gradient detector is an oriented gradient operator G(x,y,z,θ,φ,r) described in detail in Fig. 1. To be more specific, in a 3D volumetric or spatiotemporal intensity field, we place a sphere centered at each voxel to denote its neighborhood. An equatorial plane specified by its normal vector \(\vec {t}(\theta,\varphi)\) splits the sphere into two half spheres. We compute the intensity histograms for both half spheres, denoted as g and h. Then we define the gradient magnitude in the direction \(\vec {t}(\theta,\varphi)\) as the χ2 distance between g and h:
$$ \chi^{2}(\mathbf{g},\mathbf{h})=\frac{1}{2}\sum_{i}\frac{(g(i)-h(i))^{2}}{g(i)+h(i)}. $$
(1)
In order to capture gradient information at multiple scales, this gradient detector is executed for different radius values r of the neighborhood sphere. Gradients obtained from different scales are then linearly combined together using
$$ G_{s}(x,y,z,\theta,\varphi)\equiv\sum_{r}\alpha_{r}G(x,y,z,\theta,\varphi,r) $$
(2)
where α
r
weighs the gradient contribution at different scales. For multi-channel 3D images like video sequences, G
s
(x,y,z,θ,φ) is separately calculated from each channel and summed up using equal weights. Finally, the measure of boundary strength at (x,y,z) is computed as the maximum response over all directions \(\vec {t}(\theta,\varphi)\):
$$ mPb(x,y,z)\equiv\max_{\theta,\varphi}G_{s}(x,y,z,\theta,\varphi). $$
(3)
In our experiments, θ and φ take values in \(\left \{ 0,\frac {\pi }{4},\frac {\pi }{2},\frac {3\pi }{4}\right \} \) and \(\left \{ -\frac {\pi }{4},0,\frac {\pi }{4}\right \} \) respectively and in one special case, \(\varphi =\frac {\pi }{2}\). Therefore, we compute local gradients in 13 different directions. Neighborhood values of 2, 4, and 6 voxels were used for r. Equal weights α
r
were used to combine gradients across different scales. Also, as is standard, we always apply an isotropic Gaussian smoothing filter with σ=3 voxels before any gradient operation.
3.2 Globalization using a reduced order eigensystem
The globalization core of gPb-UCM is driven by nonlinear dimensionality reduction (closely related to spectral clustering). The local cues obtained from the gradient feature detection phase are globalized (and therefore emphasize the most salient boundaries in the image) by computing generalized eigenvectors of a graph Laplacian (obtained from the normalized cuts principle) [2]. However, this approach depends on solving a sparse eigensystem at the scale of the number of pixels in the image. Thus, as the size of the image grows larger, the globalization step becomes the computational bottleneck of the entire process. This problem is even more severe in the 3D setting because the voxel cardinality far exceeds the pixel cardinality of our 2D counterparts. An efficient approach was proposed in [4] to reduce the size of the eigensystem while maintaining the quality of the eigenvectors used in globalization. We generalize this method to 3D so that our approach becomes scalable to handle sizable datasets.
In the following, we describe the globalization steps: (i) graph construction and oriented intervening contour cue, (ii) reduced order normalized cuts and eigenvector computation, (iii) scale-space gradient computation on the eigenvector image, and (iv) the combination of local and global gradient information. For the most part, this pipeline mirrors the transition from mPb to gPb with the crucial difference being the adoption of reduced order normalized cuts.
3.2.1 Graph construction and the oriented intervening contour cue
In 2D, the normalized cuts approach begins with sparse graph construction obtained by connecting pixels that are spatially close to each other. gPb-UCM [1] specifies a sparse symmetric affinity matrix W using the intervening contour cue [5] which is the maximal value of mPb along a line connecting the two pixels i,j at the ends of relation W
ij
. However, this approach does not utilize all of the useful information obtained from the previous gradient feature detection step. Figure 2 describes a potential problem and our resolution. To improve the accuracy of the affinity matrix, we take the direction vector of the maximum gradient magnitude into consideration when calculating the pixel-wise affinity value. This new variant is termed the oriented intervening contour cue. For any spatially close voxels i and j, we use \(\bar {ij}\) to denote the line segment connecting i and j. \(\vec {d}\) is defined as the unit direction vector of \(\bar {ij}\). Assume P is a set of voxels that lie close to \(\bar {ij}\). For any p∈P, \(\vec {n}\) is the unit direction vector associated with its mPb value. We define the affinity value W
ij
between i and j as follows:
$$ W_{ij}=\exp\left(-\max_{p\in P}\{mPb(p)|\langle\vec{d},\vec{n}\rangle|\}/\rho\right) $$
(4)
where 〈·,·〉 is the inner product operator of the vector space and ρ is a scaling constant. In our experiments, the set P contains the voxels that are at most 1 voxel away from \(\bar {ij}\). ρ is set to 0.1. In the affinity matrix W, each voxel is connected to voxels that fall in the 5×5×5 cube centered at that voxel. Thus the graph defined by W is very sparse.
3.2.2 Reduced order normalized cuts and eigenvector computation
At this point, standard 2D gPb-UCM solves for the generalized eigenvectors of the sparse eigensystem
$$ (D-W)\vec{v}=\lambda D\vec{v} $$
(5)
where D is a diagonal matrix defined by D
ii
=Σ
j
W
ij
. However, this eigenvector problem is computationally very intensive. It becomes the bottleneck, both in time and memory efficiency, of the normalized cuts-based segmentation algorithms. To overcome this problem, an efficient and highly parallel GPU implementation was provided in [3]. However, this approach requires us to use GPU-based hardware and software suites—an unnecessary restriction at this stage of development. A clever alternative in [4, 10] builds the graph on superpixels instead of pixels to reduce the size of the eigensystem. We chose to generalize Taylor’s [4] approach to 3D as (i) the superpixel solution is more scalable than the GPU solution in terms of memory requirements, (ii) specialized GPU co-processors are not commonly available in many computing platforms like smart phones and wearable devices, and (iii) the approach in [10] is specifically designed for superpixels in each frame in video segmentation, thus not easily generalizable. Finally, the approach in [4] constructs a reduced order normalized cuts system which is easier to solve. We denote m as the number of supervoxels and n as the number of voxels. The reduced order eigensystem is denoted by
$$ \left(L^{T}(D-W)L\right)\vec{x}=\lambda'L^{T}DL\vec{x} $$
(6)
where \(L\in \mathbb {R}^{m\times n},\vec {x}\in \mathbb {R}^{m}\) and \(L\vec {x}=\vec {v}\). The purpose of L is to assign each pixel to a superpixel/supervoxel. In our approach, the supervoxels are generated by a watershed transform on the mPb image obtained from the volumetric gradient feature detection step. Obviously, the number of supervoxels m is much smaller than the number of voxels n in the 3D volumetric/spatio-temporal image. In practice, there are usually two to three orders reduction in the size of the eigensystem (from millions of voxels to few thousands of supervoxels). Therefore, it is much more efficient to solve Eq. (6) than Eq. (5).
3.2.3 Scale space gradient computation on the eigenvector image
We solve for the generalized eigenvectors \(\{\vec {x_{0}},\vec {x_{1}},\ldots,\vec {x_{n}}\}\) of the system in (6) corresponding to the smallest eigenvalues \(\left \{\lambda _{0}^{\prime },\lambda _{1}^{\prime },\ldots,\lambda _{n}^{\prime }\right \}\). As stated in [4], λ
i
in (5) will be equal to \(\lambda _{i}^{\prime }\) and \(L\vec {x_{i}}\) will match \(\vec {v_{i}}\) modulo an irrelevant scale factor, where \(\vec {v_{i}}\) are the eigenvectors of the original eigensystem (5). Similar to the 2D scenario [1], eigenvectors \(\vec {v_{i}}\) carry surface information. Figure 3 shows several example eigenvectors obtained from two types of 3D volumetric datasets. In both cases, the eigenvectors distinguish salient aspects of the original image. Based on this observation, we apply the gradient operator mPb defined in (3) to the eigenvector images. The outcome of this procedure is denoted as sPb because it represents the spectral component of the boundary detector, following the convention established in [1]:
$$ sPb(x,y,z)=\sum_{i=1}^{K}\frac{1}{\sqrt{\lambda_{i}}}{mPb}_{\vec{v_{i}}}(x,y,z). $$
(7)
Note that this weighted summation starts from i=1 because λ0 always equals 0 and \(\vec {v_{0}}\) is a vanilla image. The weighting by \(1/\sqrt {\lambda _{i}}\) is inspired by the mass-spring system in mechanics [1, 46]. In our experiments, we use 16 eigenvectors, i.e., K=16.
3.2.4 The combination of local and global gradient information
The last step is to combine local cues mPb and global cues sPb. mPb tries to capture local variations while sPb aims to obtain salient boundary surfaces. By linearly combining them together, we get a globalized boundary detector gPb:
$$ gPb(x,y,z)=\omega mPb(x,y,z)+(1-\omega)sPb(x,y,z). $$
(8)
In practice, we use equal weights for mPb and sPb. After obtaining the gPb values, we apply a post-processing step of non-maximum suppression [43] to get thinned boundary surfaces when the resulting edges from mPb are too thick. Figure 4 shows some examples of mPb, sPb and gPb.
3.3 Supervoxel agglomeration
At this point, 2D gPb-UCM proceeds with the oriented watershed transform (OWT) [1, 47, 48] to create a hierarchical segmentation of the image resulting in the ultrametric contour map. However, we find that the same strategy does not work well in 3D. The reasons are twofold. First, because of the presence of irregular topologies, it is more difficult to approximate boundary surfaces with square or triangular meshes in 3D than to approximate boundary curves with line segments in 2D. Second, following OWT, during superpixel merging, only the information of the pixels on boundaries are used during greedy boundary removal. This is not a robust design especially when we take into account the fragmentation of boundary surfaces in 3D.
For the above reasons, we turn to the popular graph-based image and video segmentation methods [6, 7] to create the segmentation hierarchy. We first apply a watershed transform to the gPb field obtained from the previous step to get an oversegmentation. Next we iteratively merge the adjacent segments beginning with this oversegmentation. The output of this procedure is a segmentation hierarchy represented by a tree-structure whose lower-level segments are always contained in higher-level segments. As in [6], the merge rules run on a graph. The nodes of the graph are regions and the edges represent region to region relations. First, for any two adjacent regions R
i
and R
j
, we assign an edge e
ij
to connect them on the graph. The weight of e
ij
is set to the χ2 distance between (Lab space) intensity value histograms of R
i
and R
j
with 20 bins used. Also, for any region R, a quantity named the relaxed internal variation RInt(R) is defined as follows:
$$ \text{RInt}(R)\equiv\text{Int}(R)+\frac{\tau}{|R|} $$
(9)
where Int(R) is defined as the maximum edge weight of its minimum spanning tree (MST). For the lowest-level regions, i.e., the regions of oversegmentation obtained from the watershed transform, Int(R) is set to 0. |R| is the voxel cardinality of region R. τ is a parameter which triggers the merging process and controls the granularity of the regions. In each iteration of merging, all the edges are traversed in ascending order. For any edge e
ij
, we merge incident regions R
i
and R
j
if the weight of e
ij
is less than the minimum of the relaxed internal variation of the two regions. Thus the merging condition is written as
$$ \text{weight}(e_{ij})<\min\{\text{RInt}(R_{i}),\text{RInt}(R_{j})\}. $$
(10)
In practice, we increase the granularity parameter τ by a factor of 1.1 in each iteration. This agglomeration process iteratively progresses until no edge meets the merging criterion. The advantage of graph-based methods is that they make use of the information in all voxels in the merged regions. Furthermore, as shown in the experiments below, we see that it overcomes the weakness of fragmented supervoxels of previous graph-based methods. This is because traditional graph-based methods are built on voxel-level graphs.
Finally, we obtain a supervoxel hierarchy represented by a bottom-up tree structure. This is the final output of the 3D UCM algorithm. The granularity of the segmentation is a user-driven choice guided by the application.