A learned sparseness and IGMRF-based regularization framework for dense disparity estimation using unsupervised feature learning
- Sonam Nahar^{1}Email author and
- Manjunath V. Joshi^{2}
https://doi.org/10.1186/s41074-016-0013-0
© The Author(s) 2017
Received: 15 April 2016
Accepted: 27 December 2016
Published: 9 February 2017
Abstract
In this work, we propose a new approach for dense disparity estimation in a global energy minimization framework. We propose to use a feature matching cost which is defined using the learned hierarchical features of given left and right stereo images and we combine it with the pixel-based intensity matching cost in our energy function. Hierarchical features are learned using the deep deconvolutional network which is trained in an unsupervised way using a database consisting of large number of stereo images. In order to perform the regularization, we propose to use the inhomogeneous Gaussian Markov random field (IGMRF) and sparsity priors in our energy function. A sparse autoencoder-based approach is proposed for learning and inferring the sparse representation of disparities. The IGMRF prior captures the smoothness as well as preserves sharp discontinuities while the sparsity prior captures the sparseness in the disparity map. Finally, an iterative two-phase algorithm is proposed to estimate the dense disparity map where in phase one, sparse representation of disparities are inferred from the trained sparse autoencoder, and IGMRF parameters are computed, keeping the disparity map fixed and in phase two, the disparity map is refined by minimizing the energy function using graph cuts, with other parameters fixed. Experimental results on the Middlebury stereo benchmarks demonstrate the effectiveness of the proposed approach.
Keywords
Stereo Disparity IGMRF Sparsity Unsupervised feature learning1 Introduction
Stereo vision has been an active research area in the field of computer vision for more than three decades. It aims to find the 3D information of a scene by using two or more 2D images captured from different viewpoints. Stereo vision has a wide range of applications, including 3D reconstruction, video coding, view synthesis, object recognition, and safe navigation in spatial environments. The main goal of binocular stereo vision is to find corresponding pixels, i.e., pixels resulting from the projection of the same 3D point onto the two image planes. The displacement between corresponding pixels is called disparity, and obtaining disparity at each pixel location forms a dense disparity map. For simplicity, the stereo images are rectified so that the corresponding points lie on the same horizontal epipolar line and this reduces the correspondence search to 1D.
In general, disparities are found by comparing pixel intensities or their features in the two images. However, estimation of disparities is an ill-posed problem due to depth discontinuities, photometric variation, lack of texture, occlusions etc., and a variety of approaches have been proposed for the same [1]. A comparison of current dense stereo algorithms is given in the Middlebury website [2]. Dense stereo matching algorithms can be classified into local and global methods. Local approaches aggregate the matching cost within a finite window and find the disparity by selecting the lowest aggregated cost. These methods assume that the disparity is the same over the entire window and hence produces unreliable matches in textureless regions and near depth discontinuities. Use of adaptive windows [3], multiple windows [4], adaptive weights [5], or bilateral filtering [6] in local methods reduce these effects but cannot avoid it completely. Global approaches tackle such problems by incorporating regularization such as explicit smoothness assumption and estimate the dense disparity map by minimizing an energy function. The most prominent stereo algorithms for minimizing the global energy function are based on graph cuts [7] and belief propagation [8] optimization methods. In general, the energy function represents a combination of a data term and a regularization term that restricts the solution space. Global approaches perform well in textured and textureless areas as well as at depth discontinuities. In this paper, we solve the dense disparity estimation problem in a global energy minimization framework.
1.1 Motivation and related work
Global stereo methods mainly focus on minimizing energy functions efficiently to improve performance. However, solutions with lower energy do not always correspond to better performance [9]. Therefore, it is important to define a proper energy function than to search for optimization techniques in order to improve the performance. Hence, in our work, we propose a new and a suitable energy function for estimating the dense disparity map in an energy minimization framework.
In the global stereo methods, the data term is generally defined by using the pixel-based matching cost between the corresponding pixels in the left and right images [1]. A pixel-based cost function determines the matching cost for disparity on the basis of a descriptor that is defined for one single pixel. Pixel-based cost function can be extended to patch (window)-based matching cost by integrating pixel-based costs within a certain neighborhood and such cost are based on census transform, normalized cross correlation, etc. [10]. Most of the pixel-based matching costs are built on the brightness constancy assumption and include absolute differences (AD), squared differences (SD), sampling insensitive absolute differences of Birchfield and Tomasi (BT), or truncated costs [10]. They rely on raw pixel values, and are less robust to illumination changes, view point variation, noise, occlusion, etc. One can represent stereo images in a better way by using a feature space where they are robust, distinct, and transformation invariant [11, 12]. Feature-based stereo methods use the features such as edges, gradients, corners, segments, or hand-crafted features such as scale-invariant feature transform (SIFT) [13, 14]. In order to obtain dense disparities, feature matching has been used in the global stereo framework. In [15] and [16], nonoverlapping segments of stereo images are used as features, and the dense stereo matching problem is cast as an energy minimization in segment domain instead of pixel domain where the disparity plane is assigned to each segment via graph cuts or belief propagation. These approaches assume that the disparities in a segment vary smoothly which is not true in practice due to the depth discontinuities. Also, the solution here relies on the accuracy of segmentation which is itself a non trivial task. In [17], the sparse correspondences are found by feature points and then the dense correspondences are obtained from these sparse matches using the propagation and seed growing methods. In such approaches, the accuracy depends on the initial support points. In [18], the mutual information (MI)-based feature matching is used in a Markov random field (MRF) framework for estimating the dense disparities. However, matching with basic image features still results in ambiguities in correspondence search, especially for textureless areas and wide baseline stereo. Hence, to reduce these ambiguities, one needs to use more descriptive features. Recently in [19], authors proposed a SIFT flow algorithm for finding the dense correspondences by matching the SIFT descriptors while preserving spatial discontinuities using MRF regularization. In [20], a deformable spatial pyramid model is proposed in a regularization framework for estimating dense disparities using multiple SIFT features. Hand-crafted features of stereo images are designed and then embedded in an MRF model in [21]. The drawback of these approaches is that designing such features is computationally expensive, time consuming, and requires domain knowledge of the data.
In recent years, learning features from unlabeled data using unsupervised feature learning and deep learning approaches have achieved superior performance in solving many computer vision problems [22–25]. Feature learning is attractive as it exploits the availability of large amount of data and avoids the need of feature engineering. It has also attracted the attention of stereo vision researchers in recent years. The method proposed in [26] uses the deep convolutional neural network for learning similarity measure on small image patches, and the training is carried in a supervised manner by constructing a binary classification dataset with examples of similar and dissimilar pair of patches. Based on the learned similarity measure, the disparity map is estimated using state-of-the-art local stereo methods. Here, the learning is done on small size patches instead of entire image, i.e., global contextual constraint is not taken into account while learning the similarity measure. The method does not provide a single framework for dense disparity estimation though it improves the results of state of the art stereo methods. In this work, we focus on the approaches which use feature matching cost in a global energy minimization framework for estimating the dense disparities. In [27], authors proposed unsupervised feature learning for dense stereo matching within a energy minimization framework. They learn the features from a large amount of image patches using K-singular value decomposition (K-SVD) dictionary learning approach. The limitation of their approach is that the features are learned from a set of image patches and do not consider the entire image, i.e., global contextual constraint is not taken into account while learning the features. Also, higher level features are not learned, instead, they are estimated using a simple max pooling operation from the layer beneath. Here, the higher layer correspondence matches are used to initialize the lower layer matching and hence the accuracy depends on the higher layer matches only. Recently, unsupervised feature learning and deep learning methods have shown superior performance in learning efficient representation of images at multiple layers [24, 28–33].
In this paper, we propose to use a feature matching cost which is defined using the learned hierarchical features of stereo image pair. In order to learn these hierarchical features, we propose to use a deep deconvolutional network [31], an unsupervised feature learning method. The deep deconvolutional network is trained over a large set of stereo images in an unsupervised way, which in turn results in a diverse set of filters. These learned filters capture image information at a different levels in the form of low-level edges, mid-level edge junctions, and high-level object parts. Features at each layer of deconvolutional network are learned in a hierarchy using the features in the previous layer. The deep deconvolutional network is quite different to the deep convolutional neural networks (CNN). Deep CNN is a bottom-up approach where an input image is subjected to multiple layers of convolutions, nonlinearities, and subsampling whereas deep deconvolutional network is a top-down appraoch where an input image is generated by a sum over convolutions of the feature maps with learned filters. Unlike deep CNN [33], the deep deconvolutional network does not spatially pool features at successive layers and hence preserves the mid-level cues emerging from the data such as edge intersections, parallelism, and symmetry. They scale well to complete images and hence learn the features for the entire input image instead of small size patches. It makes them to consider global contextual constraint while learning. In order to estimate the dense disparity map, we combine our learning-based multilayer feature matching cost with the pixel-based intensity matching cost and hence our data term has the sum of these costs.
Since the disparity estimation is an ill-posed problem, use of global stereo matching makes it better posed by incorporating a regularization prior in the energy function. Selection of the appropriate prior leads to a better solution. One common feature of the disparities is that they are piecewise smooth, i.e., they vary smoothly except at discontinuities, thus making them inhmogeneous. This spatial correlation among disparities can be captured by MRF-based models. It is well known that MRFs are the most general models used as priors during regularization when solving ill-posed problems [34]. Hence, many of the current better-performing global stereo methods are based on the MRF formulations as noted in [1]. Homogeneous MRF models tend to oversmooth the disparity map and fail to preserve the discontinuities [35]. Hence, a better model would be one that reconstructs the smooth disparities while preserving the sharp discontinuities. In order to achieve this, variety of discontinuity preserving MRF priors are used in global stereo methods as proposed in [36–40]. Many of these techniques use single or a set of global MRF parameters which are either manually tuned or estimated. These global parameters may not adapt to the local structure of the disparity map and hence fail to better capture the spatial dependence among disparities. We need a prior that considers the spatial variation among disparities locally. This motivates us to use an inhomogeneous Gaussian markov random field (IGMRF) prior in our energy function which was first proposed in [41] for solving the satellite image deblurring problem. IGMRF can handle smooth as well as sharp changes in disparity map because the local variation among disparities is captured using IGMRF parameters at each pixel location. In our approach, the IGMRF parameters are not known and are estimated.
Although IGMRF prior captures the smoothness with discontinuities, it fails to capture additional structure such as sparseness in the disparity map. In general, disparity maps are made up of homogeneous regions with limited number of discontinuities resulting in redundancy. Because of this, one can represent the disparities in a domain in which they are sparse. This transform domain representation can be obtained using the fixed set of basis such as discrete cosine transform (DCT), discrete wavelet transform (DWT), or it can be learned as an overcomplete dictionary using large number of true disparities. In [42], the disparities are reconstructed from few disparity measurements using the concepts of compressive sensing. Here, the sparseness is represented over a fixed wavelet basis and the accuracy of disparity estimation depends on the reliable measurements. Learned sparseness using the overcomplete dictionary has been successfully used as regularization for solving the inverse problems [43, 44]. The advantage of using a learned dictionary is that the representation would be more accurate than obtained with the use of fixed basis and this is done by adapting its atoms to fit a given training data [45]. Recently in [46], authors proposed a two-layer graphical model for inferring the disparity map by including a sparsity prior over learned sparse representation of disparities in an existing MRF-based stereo matching framework. Here, the sparse representation of disparities are inferred by a dictionary which is learned using a sparse coding technique which can cope up with non stationary depth estimation errors. Although it performs better when compared to discontinuity preserving homogeneous MRF prior, the solution can be improved by using inhomogeneous MRF prior. Also, their method is complex and computationally intensive.
A practical problem with dictionary learning techniques is that they are computationally expensive because the dictionaries are learned by iteratively recovering sparse vectors and updating the dictionary atoms [45, 46]. Though these methods perform well in practice, they use a linear structure. Recent research suggests that non-linear, neural networks can achieve superior performance in learning efficient representation of images [22, 24, 28, 29]. One example of these networks is a sparse autoencoder. It encodes the input data with a sparse representation in hidden layer and is trained using a large database of unlabeled images [29]. Sparse autoencoders are very efficient and they can be easily generalized to represent complicated models. In this paper, we propose to use the sparse autoencoder for learning and inferring the sparse representation of disparity map. The sparse autoencoder is trained using a large set of true disparities. We define a sparsity prior using the learned sparseness of disparities and incorporate this prior in addition to IGMRF prior in our energy function. Such sparsity priors capture higher order dependencies in the disparity map and complement the IGMRF prior.
In order to obtain the dense disparity map, we propose an iterative two-phase algorithm. In phase one, sparseness is inferred using the learned weights of the sparse autoencoder, and IGMRF parameters are computed based on the current estimate of disparity map, while in the second phase, the disparity map is refined by minimizing the energy function with other parameters fixed. We use graph cuts [7] as an optimization technique for minimizing our proposed energy function. Our experimental results demonstrate the effectiveness of our learning-based feature matching cost, IGMRF prior, and sparsity prior when used in an energy minimization framework. The experiments indicate that our method generates the state-of-the-art result and can compete the state-of-the-art global stereo methods.
The outline of the paper is as follows. In the “Problem formulation” section, we formulate our problem of dense disparity estimation in an energy minimization framework. In the “Deep deconvolutional network for extracting hierarchical features” section, we present the deep deconvolutional network model for learning the hierarchical features of stereo images and then discuss the formation of our learning-based multilayer feature matching cost. The IGMRF prior model and estimation of IGMRF parameters are addressed in the “IGMRF model for disparity” section. In “Sparse model for disparity” section, we discuss the sparse autoencoder for learning and inferring the sparse representation of disparities and then discuss the formation of sparsity prior. The formation of final energy function and the proposed algorithm for dense disparity estimation are discussed in the “Dense disparity estimation”. The experimental results and the performance of the proposed approach are dealt in the “Experimental results” section, and concluding remarks are drawn in the “Conclusion” section.
2 Problem formulation
where the data term E _{ D }(d) measures how well the d to be estimated agrees with I _{ L } and I _{ R } of a scene. The prior term E _{ P }(d) measures how good it matches with the prior knowledge about the disparity map. For finding the correspondences, we consider search from left to right as well as from right to left and hence relax the traditional ordering constraint used in disparity estimation.
where τ ^{ I } is the truncation threshold which is used to make intensity matching cost more robust against outliers. For defining the feature matching cost E _{ F }(d), we extract the features of stereo image pair at multiple layers of deep deconvolutional network and is discussed in the next section.
where E _{IGMRF}(d) and E _{sparse}(d) represent the IGMRF and sparsity prior terms, respectively. Here, γ controls the weightage of the term E _{sparse}(d).
3 Deep deconvolutional network for extracting hierarchical features
In this section, we first describe the method of learning the hierarchical features of a given stereo pair and then describe how these features are used to define our feature matching cost E _{ F }(d).
Here, \(|s^{i}_{p}|^{1}\) is the L _{1}-norm on the vectorized version of \(s^{i}_{p}\). The relative weighting of the reconstruction error of each I ^{ i } and sparsity of their feature maps \(s^{i}_{p}\) is controlled by the parameter α. This network is learned by minimizing \(C_{1}(\mathcal {I})\) with respect to \(s^{i}_{p}\)s and f _{ p,c }s when the input to network is \(\mathcal {I}\). Note that the set of filters f _{ p,c } are the parameters of the network, common to all images in the training set while each image has its own set of feature maps \(s^{i}_{p}\).
A deep deconvolutional network consisting of NL number of layers is trained upwards in a layer-wise manner starting with the first layer (l=1) where the inputs are the training images \(\mathcal {I}\). Each layer l is trained in order to learn a set of filters \(f^{l}_{p,c}\) which is shared across all images in \(\mathcal {I}\) and infer the set of feature maps \(s^{i}_{p,l}\) of each image I ^{ i } in \(\mathcal {I}\). To learn the filters, we alternately minimize \(C_{l}(\mathcal {I})\) w.r.t. the filters and feature maps by keeping one of them constant while minimizing the other. We follow the optimization scheme as proposed in [31].
3.1 Feature encoding
Once the deep deconvolutional network is trained, we can use it to infer the multilayer features of a given left I _{ L } and right I _{ R } stereo images for which we want to estimate the dense disparity map. The network described above is top-down in nature, i.e., given the latent feature maps, one can synthesize an image but there is no direct mechanism for inferring the feature maps of a given image without minimizing the cost function given in Eq. (7). Hence, once the network is learned/trained, we apply given I _{ L } and I _{ R } separately as input image to the trained deep deconvolutional network with the fixed set of learned filters and infer the feature maps \(s^{I_{L}}_{p,l}\) and \(s^{I_{R}}_{p,l}\) of I _{ L } and I _{ R } at layer l, respectively, by minimizing the cost functions C _{ l }(I _{ L }) and C _{ l }(I _{ R }), respectively. Once, they are learned, we create a feature vector at each pixel location in I _{ L } and I _{ R } separately. In order to obtain the features of I _{ L } at a layer l, we stack the P _{ l } number of inferred feature maps \(s^{I_{L}}_{p,l}\) and obtain a single feature map \(Z^{I_{L}}_{l}\) where at each pixel location (x,y) in \(Z^{I_{L}}_{l}\), we get a feature vector of dimension P _{ l }×1. Similarly, using the same process we obtain the features of I _{ R }. Thus, \(Z^{I_{L}}_{l}\) and \(Z^{I_{R}}_{l}\) represents the lth layer features of I _{ L } and I _{ R }, respectively.
3.2 Defining E _{ F }(d)
At each pixel location (x,y) having disparity d(x,y), it measures the absolute distance between the feature vector \({Z^{I_{L}}_{l}}(x,y)\) and corresponding matched feature \({Z^{I_{R}}_{l}}(x+d(x,y),y)\). Here, τ ^{ F } is the truncation threshold which is used to make feature matching cost more robust against outliers and NL is the number of layers in the network. These multiple layers feature matching technique highly constrains the solution space and hence results in unambiguous and accurate disparities.
In our energy function, the data term E _{ D }(d) is not constructed using the feature matching cost E _{ F }(d) only because the deep deconvolutional network extracts the sparse (significant) features in stereo images at few locations such as edges, corners, junctions. If one uses feature matching cost as a data term, then at those pixel locations where the features are not significant, it results in ambiguous disparity estimates. One can obtain the disparities only at the pixel locations where significant features have been obtained. However, this results in a sparse disparity map. Our goal here is estimate the dense disparity map, i.e., finding the disparity at every pixel location. Although this can be obtained simply by interpolating the sparse disparity, it leads to inaccurate disparities at occluded regions and disparity discontinuities. Since we use intensity term as well, the intensity values are available at every pixel location, giving us a dense disparity map. Hence, in our work, we define our data term using a combination of intensity and feature matching costs. The combination of intensity and features matching not only produce dense disparities but also better constrains the solution and hence results in accurate disparity map.
4 IGMRF model for disparity
In order to avoid computational difficulty, we set an upper bound b=1/4 whenever gradient becomes zero, i.e., whenever the neighboring disparities are the same.
In order to estimate IGMRF parameters, we need the true disparity map which is unknown and has to be estimated. Therefore, to start the regularization process, we use an initial estimate of disparity map obtained using a suitable approach and compute these parameters which are then used to estimate the d. In our proposed algorithm, these parameters and d are refined alternatively and iteratively for obtaining the better d.
5 Sparse model for disparity
In order to model the higher order dependencies in the disparity map, we model the disparity map in our energy function by another prior called sparsity prior E _{sparse}(d). The sparsity prior regularizes the solution by modeling the sparseness in d. In this work, we present a novel method for learning and inferring the sparse representation of disparities using sparse autoencoder, which is then used to define the sparsity prior. An autoencoder is an artificial neural network (ANN) which sets the desired output same as the input and has one hidden layer [29]. It comprises of an encoder that maps an input vector to a hidden representation and a decoder that maps this hidden representation back to a reconstructed input. In reality, finding the sparse representation of a disparity map is computationally expensive, and therefore, a better choice would be to find the sparse representation of disparity patches of small size individually and average the resultant sparse patches at the end in order to get complete sparse representation of disparity map.
where \(KL(\rho ||{\hat {\rho }}_{j})\) is the Kullback-Leilbler (KL) divergence. This term has a value 0, if \({\hat {\rho }}_{j}\) = ρ; otherwise, it increases monotonically as \({\hat {\rho }}_{j}\) diverges from ρ.
Here, the first term represents the average reconstruction error over all training inputs. The second term is a regularization term on the weights to prevent the overfitting by making them smaller in magnitude, and λ controls the relative importance of this term. β controls the weightage of the third term which corresponds to sparsity penalty term. We minimize this Eq. (15) w.r.t. W, U, r, s using well known back propagation algorithm [50].
E _{sparse}(d) measures how well each disparity patch at location (x,y) in d agrees with its sparse representations. In our proposed approach, the disparity map and its sparse representation are inferred alternatively.
6 Dense disparity estimation
Our main goal is to estimate the dense disparity map using a given pair of stereo images in an energy minimization framework. Our data term defined in Eq. (2) is formed by adding intensity and feature matching costs using Eqs. (3) and (8), respectively. Similarly, our prior energy term defined in Eq. (4) is formed by adding the IGMRF and sparsity priors using Eqs. (9) and (16), respectively. Finally, our proposed energy function defined in Eq. (1) can be rewritten as given in Eq. (17) and we minimize it using graph cuts optimization based on α- β swap moves [7]. We do not consider the occlusions explicitly but they are handled by clipping matching costs using thresholds τ = {τ ^{ I },τ ^{ F }} that prevents the outliers from disturbing the estimation (see Eqs. (3) and (8)).
In order to estimate the dense disparity map, we propose an iterative two-phase algorithm. It proceeds with the use of an initial estimate of disparity map and iterates and alternates between two phases until convergence as given in Algorithm 1. We use a classical local stereo method [1] for obtaining the initial disparity map in which the absolute intensity differences (AD) with truncation, aggregated over a fixed window is used as matching cost. In order to reduce computation time, we optimize this cost by graph cuts instead of the classic winner take all (WTA) optimization. Postprocessing operations such as left-right consistency check, interpolation, and median filtering [1] are applied in order to obtain a better initial estimate for faster convergence while regularizing. However, any other suitable disparity estimation method can also be used in obtaining the initial estimate.
In general, for nonconvex energy functions, graph cuts result in a local minimum that is within a known factor of global minimum. In order to ensure global minimum, we use an iterative optimization with proper settings of parameters. At every iteration, the IGMRF parameters and sparseness are refined in order to obtain better disparity estimates (converging towards global optima). The number of iterations may vary for different stereo pairs and the choice of initial estimate.
7 Experimental results
In this work, all the experiments were conducted on a computer with Core i7-3632QM, 2.20 GHz processor and 8.00 GB RAM.
7.1 Parameter settings
Note that the training of deep deconvolutional network and the autoencoder is an offline operation, and hence, they do not add to the computational complexity. In order to estimate the dense disparity map, we experimented on the Venus, Cones, and Teddy stereo pairs, belonging to Middlebury stereo 2001 and 2003 datasets [2] which were different from the training datasets used earlier. We also performed the experiments using the recently released Middlebury stereo 2014 (version 3) dataset. Our algorithm was initialized with the initial estimate of disparity map and the algorithm converged with in five iterations for all the stereo pairs used in our experiments. While minimizing Eq. (17), the data cost thresholds {τ ^{ I },τ ^{ F }} were set as 0.08 and 0.04, respectively, and the parameter μ was chosen as 1. The parameter γ was initially set to 10^{−4} and exponentially increased at each iteration from 10^{−4} to 10^{−1}. We used the same parameters for all the experiments, and this demonstrates the robustness of our method.
7.2 Performance evaluation using different data terms E _{ D }(d) with IGMRF prior
Performance evaluation in terms of percentage of bad matching pixels computed over the whole image with δ = 1. Here, the optimization of energy function is carried out using different data terms E _{ D }(d) with IGMRF as prior term E _{ P }(d)
E _{ D }(d) | Venus | Teddy | Cones |
---|---|---|---|
AD | 1.90 | 16.49 | 12.14 |
BT | 0.95 | 15.67 | 11.89 |
BT+gradient | 0.89 | 14.9 | 11.32 |
E _{ I }(d)+E _{ F }(d) | 0.40 | 11.41 | 9.98 |
7.3 Performance evaluation using different prior terms E _{ P }(d) with proposed E _{ D }(d)
Performance evaluation using different prior terms E _{ P }(d) with proposed E _{ D }(d). The errors are shown in terms of bad matching pixels and these are computed over the whole image with δ=1
E _{ P }(d) | Venus | Teddy | Cones |
---|---|---|---|
Truncated quadratic | 1.95 | 15.38 | 11.62 |
Truncated linear | 0.91 | 12.86 | 10.96 |
Potts | 1.11 | 13.93 | 11.01 |
E _{IGMRF}(d) | 0.40 | 11.41 | 9.64 |
E _{IGMRF}(d) + E _{sparse}(d) using DCT | 0.38 | 11.1 | 9.36 |
E _{IGMRF}(d) + E _{sparse}(d) using K-SVD | 0.30 | 10.60 | 9.12 |
E _{IGMRF}(d) + E _{sparse}(d) using autoencoder | 0.20 | 9.76 | 8.46 |
7.4 Qualitative and quantitative assessment and comparison with state of the art methods
Quantitative evaluation on Middlebury stereo 2001 and 2003 datasets [2] and comparison with state-of-the-art global dense stereo methods in terms of bad matching pixels over entire image as well as non occluded regions with δ = 1
Method | Venus | Teddy | Cones | |||
---|---|---|---|---|---|---|
All | Nonocc | All | Nonocc | All | Nonocc | |
Initial | 3.47 | 2.00 | 19.65 | 5.61 | 16.43 | 7.15 |
Proposed | 0.20 | 0.10 | 9.76 | 3.44 | 8.46 | 2.36 |
AdaptBP [16] | 0.21 | 0.10 | 7.06 | 4.22 | 7.92 | 2.48 |
DoubleBP [38] | 0.45 | 0.13 | 8.30 | 3.53 | 8.78 | 2.90 |
GCP [52] | 0.53 | 0.16 | 11.5 | 6.44 | 9.49 | 3.59 |
TwoStep [17] | 0.45 | 0.27 | 12.6 | 7.42 | 10.1 | 4.09 |
SemiGlob [18] | 1.57 | 1.00 | 12.2 | 6.02 | 9.75 | 3.06 |
2OP [39] | 0.49 | 0.24 | 15.4 | 10.9 | 10.8 | 5.42 |
CompSens [42] | 0.68 | 0.31 | 13.30 | 7.88 | 9.79 | 3.97 |
MultiGC [37] | 3.13 | 2.79 | 17.6 | 12.0 | 11.8 | 4.89 |
Mumford [51] | 0.76 | 0.28 | 14.3 | 9.34 | 9.91 | 4.14 |
GC [36] | 3.44 | 1.79 | 25.0 | 16.5 | 18.2 | 7.70 |
CRF [53] | 1.3 | – | 11.1 | – | 10.8 | – |
Sparse [46] | – | – | 11.98 | – | 8.14 | – |
8 Conclusion
We have presented a new approach for dense disparity map estimation based on inhomogeneous MRF and sparsity priors in an energy minimization framework. The data term is defined using the combination of intensity and the learning-based multilayer feature matching costs. The feature matching cost is defined over the deep learned features of given stereo pair, and we have used deep deconvolutional network for learning these hierarchical features. The IGMRF prior captures the smoothness in disparities and preserves the discontinuities in terms of IGMRF parameters. The sparsity prior is defined over the learned sparseness of disparities where the sparse representation of disparities are learned using the sparse autoencoder. We have presented an iterative two-phase algorithm for disparity estimation where in phase one, the disparity map is estimated by minimizing our energy function using graph cuts and in phase two, the IGMRF parameters and sparse representation of disparity maps are obtained. Experiments conducted on various datasets of Middlebury site verify the effectiveness of the proposed data term, IGMRF, and sparsity priors when used in an energy minimization framework. Performance of the proposed method is comparable to many of the better performing and latest dense stereo methods.
Declarations
Authors’ contributions
Both authors have equally contributed to the manuscript. Both authors read and approved the final manuscript.
Authors’ information
Sonam Nahar received the B.E. degree in Information Technology from Manikya Lal Verma Textile Enginerring College, Bhilwara, India, in 2008, and M.Tech. degree in Information and Communication Technology from Dhirubhai Ambani Institute of Information and Technology (DA-IICT), Gandhinagar, India, in 2010. She is currently pursuing the Ph.D degree from DA-IICT, Gandhinagar, India, and serving as an Assistant Professor with The Laxmi Niwas Mittal Institute of Information Technology (LNMIIT), Jaipur, India, in Computer Science and Engineering Department. Her research interests include computer vision, image processing, and deep learning. Manjunath V. Joshi received the B.E. degree from the University of Mysore, Mysore, India, and the M.Tech. and Ph.D. degrees from the Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India. Currently, he is serving as a Professor with the Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India. He has been involved in active research in the areas of signal processing, image processing, and computer vision. He has coauthored two books entitled Motion-Free Super Resolution (Springer, New York) and Digital Heritage Reconstruction Using Super resolution and Inpainting (Morgan and Claypool). Dr. Joshi was a recipient of the Outstanding Researcher Award in Engineering Section by the Research Scholars Forum of IIT Bombay. He was also a recipient of the Best Ph.D. Thesis Award by Infineon India and the Dr. Vikram Sarabhai Award in the field of information technology constituted by the Government of Gujarat, India.
Competing interests
The authors declare that they have no competing interests.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Scharstein D, Szeliski R, Zabih R (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vis 47(1/2/3): 7–42.View ArticleMATHGoogle Scholar
- Scharstein D, Szeliski R, Zabih R (1987) Middlebury Stereo. http://vision.middlebury.edu/stereo.
- Kanade T, Okutomi M (1994) A stereo matching algorithm with an adaptive window: theory and experiment. Pattern Anal Mach Intell IEEE Trans 16(9): 920–932.View ArticleGoogle Scholar
- Fusiello A, Roberto V, Trucco E (1997) Efficient stereo with multiple windowing In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 858–863. doi:http://dx.doi.org/10.1109/CVPR.1997.609428.
- Yoon KJ, Kweon IS (2006) Adaptive support-weight approach for correspondence search. Pattern Anal Mach Intell IEEE Trans 28(4): 650–656.View ArticleGoogle Scholar
- Hosni A, Rhemann C, Bleyer M, Rother C, Gelautz M (2013) Fast cost-volume filtering for visual correspondence and beyond. Pattern Anal Mach Intell IEEE Trans 35(2): 504–511.View ArticleGoogle Scholar
- Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts?Pattern Anal Mach Intell IEEE Trans 26(2): 147–159.View ArticleMATHGoogle Scholar
- Sun J, Zheng NN, Shum HY (2003) Stereo matching using belief propagation. Pattern Anal Mach Intell IEEE Trans 25(7): 787–800.View ArticleMATHGoogle Scholar
- Tappen MF, Freeman WT (2003) Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters, vol.2 In: Proceedings Ninth IEEE International Conference on Computer Vision, 900–906. doi:http://dx.doi.org/10.1109/ICCV.2003.1238444.
- Hirschmuller H, Scharstein D (2007) Evaluation of cost functions for stereo matching In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8. doi:http://dx.doi.org/10.1109/CVPR.2007.383248.
- Tola E, Lepetit V, Fua P (2010) Daisy: An efficient dense descriptor applied to wide-baseline stereo. Pattern Anal Mach Intell IEEE Trans 32(5): 815–830.View ArticleGoogle Scholar
- Joglekar J, Gedam SS, Mohan BK (2014) Image matching using sift features and relaxation labeling technique:a constraint initializing method for dense stereo matching. Geosci Remote Sensing, IEEE Trans 52(9): 5643–5652.View ArticleGoogle Scholar
- Grimson WEL (1985) Computational experiments with a feature based stereo algorithm. Pattern Anal Mach Intell IEEE Trans 7(1): 17–34.View ArticleGoogle Scholar
- Ayache N, Faverjon BEfficient registration of stereo images by matching graph descriptions of edge segments. International Journal of Computer Vision: 107–131.Google Scholar
- Hong L, Chen G (2004) Segment-based stereo matching using graph cuts In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, Vol.1 I-74-I-81. doi:http://dx.doi.org/10.1109/CVPR.2004.1315016.
- Klaus A, Sormann M, Karner K (2006) Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure, vol.3 In: 18th International Conference on Pattern Recognition (ICPR’06), 15–18. doi:http://dx.doi.org/10.1109/ICPR.2006.1033.
- L. Wang ZL, Zhang Z (2014) Feature based stereo matching using two-step expansion. Math Probl Eng 14: 14.Google Scholar
- Hirschmüller H (2008) Stereo processing by semi-global matching and mutual information. Pattern Anal Mach Intell IEEE Trans 30(2): 328–341.View ArticleGoogle Scholar
- Liu C, Yuen J, Torralba A (2011) Sift flow: dense correspondence across scenes and its applications. Pattern Anal Mach Intell IEEE Trans 33(5): 978–994.View ArticleGoogle Scholar
- Kim J, Liu C, Sha F, Grauman K (2013) Deformable spatial pyramid matching for fast dense correspondences In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2307–2314. doi:http://dx.doi.org/10.1109/CVPR.2013.299.
- Saxena A, Chung SH, Ng AY (2007) 3-D depth reconstruction from a single still image. Int J Comput Vis 76: 2007.View ArticleGoogle Scholar
- Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11: 3371–3408.MathSciNetMATHGoogle Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks In: Advances in Neural Information Processing Systems 25, 1097–1105.Google Scholar
- Bengio Y (2009) Learning deep architectures for AI. Foundations Trends Mach Learn 2(1): 1–127.View ArticleMATHGoogle Scholar
- Dong C, Loy CC, He K, Tang X (2015) Image super-resolution using deep convolutional networks. CoRRabs/1501.00092.Google Scholar
- Zbontar J, LeCun Y (2014) Computing the stereo matching cost with a convolutional neural network. CoRRabs/1409.4326.Google Scholar
- Zhang C, Shen C (2015) Unsupervised feature learning for dense correspondences across scenes. CoRRabs/1501.00642.Google Scholar
- Poultney C, Chopra S, Lecun Y (2006) Efficient learning of sparse representations with an energy-based model In: Advances in Neural Information Processing Systems.Google Scholar
- Lee H, Ekanadham C, Ng AY (2007) Sparse deep belief net model for visual area v2 In: Neural Information Processing Systems, 873–880.Google Scholar
- Hinton GE, Osindero S (2006) A fast learning algorithm for deep belief nets. Neural Comput 18: 2006.MathSciNetView ArticleMATHGoogle Scholar
- Zeiler MD, Krishnan D, Taylor GW, Fergus R (2010) Deconvolutional networks In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2528–2535. doi:http://dx.doi.org/10.1109/CVPR.2010.5539957.
- Zeiler MD, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning In: Computer Vision, IEEE International Conference On, 2018–2025.Google Scholar
- Jarrett K, Kavukcuoglu K, Ranzato MA, Lecun Y (2009) What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, 2146–2153. doi:http://dx.doi.org/10.1109/ICCV.2009.5459469.
- Li SZ (1995) Markov random field modeling in computer vision. Springer, New York.View ArticleGoogle Scholar
- Roy S (1999) Stereo without epipolar lines: a maximum-flow formulation. Int J Comput Vis 34(2–3): 147–161.View ArticleGoogle Scholar
- Boykov Y, Veksler O, Zabih R (2001) Fast approximate energy minimization via graph cuts. Pattern Anal Mach Intell IEEE Trans 23(11): 1222–1239.View ArticleGoogle Scholar
- Kolmogorov V, Zabih R (2002) Multi-camera scene reconstruction via graph cuts In: Computer Vision, European Conference On, 82–96.Google Scholar
- Yang Q, Wang L, Yang R, Stewenius H, Nister D (2009) Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. Pattern Anal Mach Intell IEEE Trans 31(3): 492–504.View ArticleGoogle Scholar
- Woodford O, Torr P, Reid I, Fitzgibbon A (2008) Global stereo reconstruction under second order smoothness priors In: Computer Vision and Pattern Recognition, IEEE Conference On, 1–8.Google Scholar
- Zhang L, Seitz SM (2005) Parameter estimation for MRF stereo, vol.2 In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05),288–295. doi:http://dx.doi.org/10.1109/CVPR.2005.269.
- Jalobeanu A, Blanc-Feraud L, Zerubia J (2004) An adaptive gaussian model for satellite image deblurring. Image Process IEEE Trans 13(4): 613–621.View ArticleGoogle Scholar
- Hawe S, Kleinsteuber M, Diepold K (2011) Dense disparity maps from sparse disparity measurements In: 2011 International Conference on Computer Vision, 2126–2133. doi:http://dx.doi.org/10.1109/ICCV.2011.6126488.
- Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. Image Process IEEE Trans 15(12): 3736–3745.MathSciNetView ArticleGoogle Scholar
- Xie J, Xu L, Chen E (2012) Image denoising and inpainting with deep neural networks In: Advances in Neural Information Processing Systems 25, 350–358.Google Scholar
- Aharon M, Elad M, Bruckstein A (2006) K -SVD: An algorithm for designing overcomplete dictionaries for sparse representation. Signal Process IEEE Trans 54(11): 4311–4322.View ArticleGoogle Scholar
- Tosic I, Olshausen BA, Culpepper BJ (2011) Learning sparse representations of depth. Selected Topics Signal Process IEEE J 5(5): 941–952.View ArticleGoogle Scholar
- Birchfield S, Tomasi C (1998) A pixel dissimilarity measure that is insensitive to image sampling. Pattern Anal Mach Intell IEEE Trans 20(4): 401–406.View ArticleGoogle Scholar
- Joshi M, Jalobeanu A (2010) Map estimation for multiresolution fusion in remotely sensed images using an IGMRF prior model. Geosci Remote Sensing IEEE Trans 48(3): 1245–1255.View ArticleGoogle Scholar
- Gajjar PP, Joshi MV (2010) New learning based super-resolution: use of DWT and IGMRF prior. Image Process IEEE Trans 19(5): 1201–1213.MathSciNetView ArticleGoogle Scholar
- Mitchell TM (1997) Machine learning. McGraw-Hill, New York, USA.MATHGoogle Scholar
- Ben-Ari R, Sochen N (2010) Stereo matching with Mumford-Shah regularization and occlusion handling. Pattern Anal Mach Intell IEEE Trans 32(11): 2071–2084.View ArticleGoogle Scholar
- Wang L, Yang R (2011) Global stereo matching leveraged by sparse ground control points In: CVPR 2011, 3033–3040. doi:http://dx.doi.org/10.1109/CVPR.2011.5995480.
- Scharstein D, Pal C (2007) Learning conditional random fields for stereo In: Computer Vision and Pattern Recognition, IEEE Conference On, 1–8.Google Scholar