Temporally coherent disparity maps using CRFs with fast 4D filtering

State-of-the-art methods for disparity estimation achieve good results for single stereo frames, but temporal coherence in stereo videos is often neglected. In this paper, we present a method to compute temporally coherent disparity maps. We define an energy over whole stereo sequences and optimize their conditional random field (CRF) distributions using the mean-field approximation. In addition, we introduce novel terms for smoothness and consistency between the left and right views. We perform CRF optimization by fast, iterative spatio-temporal filtering with linear complexity in the total number of pixels. We propose two CRF optimization techniques, using parallel and sequential updates, and compare them in detail. While parallel updates are not guaranteed to converge, we show that, in practice with appropriate initialization, they provide the same quality as sequential updates and they also lead to faster implementations. Finally, we demonstrate that the results of our approach rank among the state of the art while having significantly less flickering artifacts in stereo sequences.


Introduction
While some disparity estimation methods leverage information over several frames of stereo video sequences, most do not attempt to produce temporally coherent disparity maps.In applications like video production for 3D displays, however, temporally coherent disparity maps are crucial.While human observers are more forgiving about incorrect disparities, they easily notice flickering artifacts due to temporally incoherent disparity maps.
We address these challenges by proposing a technique that produces temporally coherent disparity maps over stereo videos.We formulate an energy minimization problem consisting of unary, smoothness, and consistency terms, which we solve using the mean-field approximation of a densely connected CRF.Our contributions are: 1) a new smoothness term that leverages both left and right images to distinguish between image edges due to disparity discontinuities, and edges due to surface texture; 2) a novel consistency term to obtain a joint left-and-right disparity estimation problem; 3) a temporal smoothness term to achieve temporally coherent disparity maps over stereo video sequences (Figure 1).Our algorithm has linear complexity in terms of image resolution and number of frames, and our GPU implementation requires only a few seconds per frame.Our method ranks among the state of the art in the KITTI benchmark [3].

Related work
Disparity estimation is mostly defined as a discrete labeling problem.Aggregation-based methods [10] share the cost of each assignment with neighboring pixels to reduce noise.They are efficient, but unable to reason about more complex assignment configurations.Optimization-based methods try to find the best assignment of disparities by minimizing an energy function.Semi Global Matching (SGM) [5] is a fast and effective approach that enforces local smoothness over many directional scan-lines using dynamic programming.While SGM is able to find a semi-global establishment of disparity labels, it is unable to capture the local structure due to the simple energy function.
On the other hand, filter-based mean-field approximation [6] supports very fast optimization over a fully-connected CRF.Many methods use a multi-scale approach to increase robustness to local minima [17].We use the SGM method to initialize our CRF-based optimization, which further incorporates other complex terms.Some methods use several stereo frames and attempt to ensure temporal coherence.Slanted plane StereoFlow [15] uses two consecutive frames to improve results.Vogel et al. [13] use a piece-wise rigid model to include consistencies in the temporal dimension.Unlike these methods we do not enforce segmentation nor local planarity on our disparity maps.In addition, our method has linear complexity with respect to the number of frames, which allows us to compute the disparity maps of the whole sequence in a single optimization.
Disparity flicker artifacts have been previously addressed [11,9].Richardt et al. [11] assumed that the pixel's disparity persist in time and aggregated the costs between temporally consecutive pixels.Min et al. [9] filtered noisy disparity maps between different frames.In addition to endto-end disparity error, we propose a quantitative measure to better evaluate the flicker artifacts in disparity sequences and compare with previous works.Here we compare disparity maps from TDCBG [11] using 8 frames, PRSM [13] using three consecutive frames, and our method using 21 frames.On the right we show the average disparity flicker index in this sequence.Our algorithm and TDCBG [11] allow controlling temporal smoothness using a temporal support parameter σt.Sequence courtesy of MEDIA LEADER Srl (www.medialeadersrl.com).

Proposed Method
We first define our energy terms, fast spatio-temporal energy minimization, initialization, and post processing, followed by a description of our GPU implementation.
We define random variables x L i for the disparity values of pixels i in the disparity field X L of the left image, and similarly x R i in X R for the right image.Our joint energy function over X L and X R includes unary (per-pixel), smoothness, and consistency terms.We omit the left and right superscripts unless necessary.
Unary Term.We denote the cost of assigning disparity d to pixel i in the left image L by the unary term φ L u (x i = d).We compute this term based on edge differences and Census transform distances similar to Yamaguchi et al. [15].
Disparity-Dependent Smoothness Term.The goal of the smoothness term is to encourage pairs of pixels that are close in some sense (defined more precisely below), to get similar disparity assignments.We define the smoothness term φ L s (x i = d i , x j = d j ) for a pair of assignments x i = d i and x j = d j in the left image as a function of both the pixel locations i, j and the disparity assignments d i , d j (similarly for the right image).We express this term as a sum of weights W L (P ) over all paths P that connect the points i, d i and j, d j in the joint pixel-disparity space, where P(i, d i , j, d j ) is the set of all paths between i, d i and j, d j in the joint space of pixel locations and disparity hypotheses, and each path P = { k, d } is a sequence of (4connected) pixels k paired with a disparity hypothesis d.We define the weight kernel W based on three length functions of the path, its length l s (P ) in the image, its length l d (P ) in the disparity label space, and a length δ L (discussed below) that takes into account potential disparity discontinuities along the path.Specifically, the weight kernel is , where σ r , σ s , and σ d control the kernel support for the three length terms.Applying a Gaussian to the weighted sum of the three distances ensures that W L (P ) decreases when the two pixels are separated by a large distance, and it increases when they are close.This choice of weight will later allow us to efficiently compute the smoothness energy.
The key ingredient in the definition of W L (P ) is the length δ L (P ), which we design to become large when the path crosses depth discontinuities.Crucially, we consider color information from both (left and right) views to compute the path length δ L (P ) such that it depends on the disparities along the path P .For each disparity on the path, we compute a pixel-wise difference of the two views where one is shifted by that disparity.At pixels where the disparity happens to be the correct one, this will cancel image edges due to surface textures, indicating that these edges are not disparity discontinuities.If the disparity is wrong, image edges typically do not cancel.We use this intuition to define a disparity discontinuity indicator for pixel k and disparity , where taking the minimum makes sure we do not introduce any spurious discontinuities.The path length δ L (P ) is now simply the sum of these disparity discontinuity indicators along the path, where L and R denote the left and right color images.This distance will be small if the pixel colors along the path have correspondences in the other image under their disparities, even if the image itself has large color dissimilarities along that path.(a,d) show ground truth disparities in red, and some estimated disparities consisting of fronto-parallel segments in green.The smoothness energy for the red and green disparity assignments are shown using the conventional (b, c) and our approach (e, f).
We visualize our approach in Figure 2. We show slices of the joint disparity-pixel space (d, i), where disparities d are along the horizontal axis, and the vertical axis corresponds to one vertical column of pixels i.The data is from a continuous, slanted surface patch that is highly textured (ground region in Figure 3, top left).Figure 2a 2a,d show the ground truth disparities in red, and some estimated disparities consisting of fronto-parallel segments in green.In Figures 2b,c,e,f we visualize the smoothness energy for the red and green disparity assignments using the conventional and our approach.That is, each point (d, i) in these figures shows the sum j φ L s (x i = d, x j = ∆ j ) where the ∆ contain either the ground truth (red) or estimated (green) disparities.We also indicate the total smoothness energy i,j φ L s (x i = ∆ i , x j = ∆ j ).This shows that in the conventional approach some pixels have high smoothness energies even with the ground truth disparity assignment, and the total smoothness energy of the piecewise fronto-parallel disparities (green) is actually lower than the ground truth here.With our approach, we obtain low smoothness energies at all pixels, and the ground truth has lower energy than the piecewise fronto-parallel assignments.
Higher Order Local Consistency Term.Each disparity assignment indicates that the corresponding pixel appears with a shift (disparity) in the other image, therefore we ex-pect that the disparity in the other view would agree with this assignment.We design the consistency energy to be low if the disparity assignments in two corresponding pixels in the left and right image agree.As a key idea, we compute this term over pixel neighborhoods, instead of individual pixels, to be more robust to per-pixel errors.We first introduce a binary consistency factor ν which is one when two corresponding pixels x L j and x R j+x L j (according to the disparity assignment in the left image) agree on their disparities up to a threshold of one disparity level, and zero otherwise.We allow for a difference of one disparity level to compensate for sub-pixel disparities and self occlusions.We now define the consistency energy as where we sum over all paths between joint pixel-disparity assignments x L i and x L j and use the same path weight W L (P ) as for the smoothness term.Intuitively, given an assignment x L i , our consistency energy is low if many assignments x L j that are close to x L i in the left image, have consistent assignments x R j+x L j in the right image.Since we cannot confirm consistency in the case of occlusions, we ignore them here and treat them later when finalizing the disparity map.
Temporal Extension.A main advantage of our filterbased CRF optimization is that we can easily extend it to the temporal domain, and simultaneously optimize disparity assignments over all frames of a stereo video sequence.We define the smoothness and consistency energies (φ c , φ s ) as before, but now with weight kernels W over paths in the joint spatio-temporal and disparity domain, , where l t (P ) is the length of the path in time, and σ t determines the kernel width along time.Our assumption here is that the disparities persist over a short time defined by σ t .As a key idea, we define the temporal dimension by following flow vectors of a precomputed flow field over the video sequence.Specifically we use the flow by Lang et al. [7], and refer the reader to their paper for more details.
Energy Minimization via Mean-Field Approximation.
We define the global energy function E as a sum of the unary, smoothness, and consistency terms, all evaluated on Algorithm 1 Optimization of the left and right disparity maps using mean-field initialize . switch L and R end loop both left and right images, with parameters λ and γ to control the influence of the smoothness and consistency terms relative to the unary term.
We minimize the energy function by following the filterbased mean-field approximation [6] (Algorithm 1).In each step we update the probability Q L i (x i = d) of assigning disparity d to variable x i .We compute the per-pixel consistencies by multiplying the two probabilities and adding to the current distribution values (line 1).Next we compute the expected value of the smoothness and consistency terms (line 2) using a single, fast filtering operation over the accumulated values Qi .A single filtering step is possible since we have the same weights W defined in φ s and φ c .The iteration ends by completing the update (lines 3, 4) and switching the target distribution (line 5).
We compute the path weights W efficiently using the Domain Transform Filter [2].We use interpolated convolution by iteratively applying a moving sum (box filter) in the transformed domain.The joint image and disparity space leads to 3D filtering, and our temporal extension to 4D filtering over two spatial, the temporal, and the disparity dimensions in line 2 of Algorithm 1.In the temporal dimension we filter along the precomputed flow vectors similar as Lang et al. [7].We obtained our best results by iterating over passes along spatio-temporal directions and filter in the disparity domain at the end.We refer to the original publication [2] for more details about filtering.
Initialization.For initializing Algorithm 1 we leverage semi-global matching (SGM) [5] with penalties P 1 = 4, P 2 = 64 in four directions.Instead of the MAP results of SGM, we rather use the obtained energies to initialize our distribution Q i (d).For a better initialization, we run the first two iterations of the optimization using a large kernel support (σ s = 7, σ r = 100, σ d = 2).
Final Disparity Map.We compute final disparities by finding the one with the minimum energy − log(Q i (d)) from Algorithm 1.For accuracy below the level of the disparity discretization we fit a quadratic to the three disparity costs centered at the minimum.We remove spikes by applying a 5 × 5 median filter.We fill occluded regions by checking for left-right consistency to find pixels with disparity differences higher than a threshold, and replacing disparities marked as occluded with the last non-occluded disparity in the left direction for the left view (similarly for the right view).
Implementation.The CPU version of the proposed pipeline supports 256 or more disparity hypotheses.We also implemented a GPU version for the whole pipeline that takes advantage of parallelism in the optimization at the pixel level.We ran our experiments on an Nvidia Titan Black graphics card with 6GB memory on board.We allocate memory for a batch of left and right images, including the disparity hypothesis layers requiring 2 × W idth × Height × F rames × Disparities floating point values.Because of the limited GPU memory we are currently restricted to batches of 14 frames at a resolution of 960 × 540 and 32 disparity layers.Note that we evaluate the unary term at a finer discretization of disparity steps, typically at one pixel steps.We then store the minimum for each of the 32 layers.At the end of the optimization the disparity is computed and finalized as described above, and by fitting the quadratic to the 32 layers we achieve finer levels of disparity.After the disparities of a batch of frames are computed, we move forward by seven frames and compute the disparities for the next batch.We finally interpolate the disparity values of the overlapping frames in consecutive batches for smoother transitions.

1Figure 1 .
Figure1.Our optimization includes the temporal dimension to achieve temporally coherent disparity maps in linear time.Here we compare disparity maps from TDCBG[11] using 8 frames, PRSM[13] using three consecutive frames, and our method using 21 frames.On the right we show the average disparity flicker index in this sequence.Our algorithm and TDCBG[11] allow controlling temporal smoothness using a temporal support parameter σt.Sequence courtesy of MEDIA LEADER Srl (www.medialeadersrl.com).

i 6 i 9 dFigure 2 .
Figure 2. Visualization of the smoothness energy of a slice of the joint disparity-pixel space.(a) shows discontinuities given by pixel differences and (d) our proposed indicator δ L function per-pixel.(a,d)show ground truth disparities in red, and some estimated disparities consisting of fronto-parallel segments in green.The smoothness energy for the red and green disparity assignments are shown using the conventional (b, c) and our approach (e, f).
shows conventional disparity discontinuity indicators given by pixel differences |L i − L i−1 |, and Figure 2d are our proposed indicators min(|L i −

Figure 3 .
Figure 3. Examples results from the KITTI dataset.First row left image, middle row our final disparity and last row shows the errors clamped to 5.