Visual saliency detection for RGB-D images under a Bayesian framework
- Songtao Wang^{1, 2}View ORCID ID profile,
- Zhen Zhou^{1}Email author,
- Wei Jin^{2} and
- Hanbing Qu^{2}
https://doi.org/10.1186/s41074-017-0037-0
© The Author(s) 2018
Received: 31 May 2017
Accepted: 13 December 2017
Published: 10 January 2018
Abstract
In this paper, we propose a saliency detection model for RGB-D images based on the deep features of RGB images and depth images within a Bayesian framework. By analysing 3D saliency in the case of RGB images and depth images, the class-conditional mutual information is computed for measuring the dependence of deep features extracted using a convolutional neural network; then, the posterior probability of the RGB-D saliency is formulated by applying Bayes’ theorem. By assuming that deep features are Gaussian distributions, a discriminative mixed-membership naive Bayes (DMNB) model is used to calculate the final saliency map. The Gaussian distribution parameters can be estimated in the DMNB model by using a variational inference-based expectation maximization algorithm. The experimental results on RGB-D images from the NLPR dataset and NJU-DS400 dataset show that the proposed model performs better than other existing models.
Keywords
1 Introduction
Saliency detection is a fundamental problem in computer vision that aims to highlight visually salient regions or objects in an image. Le Callet and Niebur introduced the concepts of overt and covert visual attention and the concepts of bottom-up and top-down processing [1]. Visual attention models have been successfully applied in many domains, including multimedia delivery, visual retargeting, quality assessment of images and videos, medical imaging, and 3D image applications [1]. Today, with the development of 3D display technologies and devices, various applications are emerging for 3D multimedia, such as 3D video retargeting [2], 3D video quality assessment [3, 4], and so forth. Overall, the emerging demand for visual attention-based applications for 3D multimedia has increased the need for computational saliency detection models for 3D multimedia content.
Salient object detection has attracted a lot of interest in computer vision [5]. Numerous efforts have been devoted to designing different low-level saliency cues for 2D saliency detection, such as contrast-based features and background priors. Because human attention is preferentially attracted by high-contrast regions with their surroundings, contrast-based features (such as colour, edge orientation, or texture contrast) have a crucial role in deriving salient objects [6]. The background prior leverages the fact that most salient objects are located far from image boundaries [7]. Based on the basic assumption, which non-salient regions (i.e. background) can be explained by the low-rank matrix, salient objects can also be defined as the sparse noises in a certain feature space where the input image is represented as a low-rank matrix [8]. Most existing computational visual saliency models follow a bottom-up framework that generates independent saliency map in each selected visual feature space and combines them in a proper way. To address these problems, Li et al. proposed a saliency map computational model based on tensor analysis [9].
The recently introduced sensing technologies, such as Microsoft Kinect, provide an excellent ability and flexibility to capture RGB-D images. In addition to RGB information, depth has been shown to be one of the practical cues for extracting saliency. Furthermore, Ju et al. proposed a novel saliency method that worked on depth images based on the anisotropic centre-surround difference [10]. In contrast to saliency detection for 2D images, the depth factor must be considered when performing saliency detection for RGB-D images. Depth cues provide additional important information about content in the visual field and can therefore also be considered relevant features for saliency detection. With the additional depth information, RGB-D co-saliency detection, which is an emerging and interesting issue in saliency detection, aims to discover the common salient objects in a set of RGB-D images [11]. The stereoscopic content carries important additional binocular cues for enhancing human depth perception [12, 13]. Therefore, two important challenges when designing 3D saliency models are how to estimate the saliency from depth cues and how to combine the saliency from depth features with those of other 2D low-level features.
In this paper, we propose a new computational saliency detection model based on the deep features of RGB images and depth images within a Bayesian framework. The main contributions of our approach consist of two aspects: (1) to estimate saliency from depth cues, we propose the creation of a depth feature based on a convolutional neural network (CNN) trained by supervision transfer, and (2) by assuming that the deep features of RGB images and depth images are conditionally independent given the classes, the discriminative mixed-membership naive Bayes (DMNB)[14] model is used to calculate the final saliency map by applying Bayes’ theorem.
2 Related work
In this section, we provide a brief survey and review of RGB-D saliency detection methods. These methods all contain a stage in which 2D saliency features are extracted. However, depending on the way in which they use depth information in terms of developing computational models, these models can be classified into three different categories:
Depth-weighting models This type of model adopts depth information to weight a 2D saliency map to calculate the final saliency map for RGB-D images with feature map fusion [15–18]. Fang et al. proposed a novel 3D saliency detection framework based on colour, luminance, texture, and depth contrast features, and they designed a new fusion method to combine the feature maps to obtain the final saliency map for RGB-D images [15]. In [16], colour contrast features and depth contrast features were calculated to construct an effective multi-feature fusion to generate saliency maps, and multi-scale enhancement was performed on the saliency map to further improve the detection precision, focusing on 3D salient object detection. Ciptadi et al. proposed a novel computational model of visual saliency that incorporates depth information and demonstrated the method by explicitly constructing a 3D layout and shape features from depth measurements [17]. Iatsun et al. proposed a 3D saliency model by relying on 2D saliency features jointly with depth obtained from monocular cues, in which 3D perception is significantly based on monocular cues [18]. The models in this category combine 2D features with a depth feature to calculate the final saliency map, but they do not include the depth saliency map in their computation processes.
Depth-pooling models This type of model combines depth saliency maps and traditional 2D saliency maps simply to obtain saliency maps for RGB-D images [19–22]. Peng et al. provided a simple fusion framework that combines existing RGB-produced saliency with new depth-induced saliency: the former one is estimated from existing RGB models, whereas the latter one is based on the multi-contextual contrast model [19]. Ren et al. presented a two-stage 3D salient object detection framework, which first integrates the contrast region with the background, depth and orientation priors to achieve a saliency map and then reconstructs the saliency map globally [20]. Xue et al. proposed an effective visual object saliency detection model via RGB and depth cues with mutually guided manifold ranking and obtained the final result by fusing RGB and depth saliency maps [21]. Wang et al. proposed two different ways to integrate depth information in the modelling of 3D visual attention, where the measures of depth saliency are derived from the eye movement data obtained from an eye tracking experiment using synthetic stimuli [22]. The models in this category rely on the existence of “depth saliency maps”. These depth saliency maps are finally combined with 2D saliency maps using a saliency map pooling strategy to obtain the final 3D saliency map.
Learning-based models Rather than using a depth saliency map directly, this type of model uses machine learning techniques to construct a 3D saliency detection model for RGB-D images based on extracted 2D features and depth features [23–26]. Inspired by the recent success of machine learning techniques in constructing 2D saliency detection models, Fang et al. proposed a learning-based model for RGB-D images using a linear SVM [23]. Zhu et al. proposed a learning-based approach for extracting saliency from RGB-D images, in which discriminative features can be automatically selected by learning several decision trees based on the ground truth, and those features are further utilized to search the saliency regions via the predictions of the trees [24]. Bertasius et al. developed an EgoObject Representation, which encodes these characteristics by incorporating shape, location, size, and depth features from an egocentric RGB-D image, and trained a random forest regressor to predict the saliency of a region using the ground-truth salient object [25]. Qu et al. designed a new CNN to fuse different low-level saliency cues into hierarchical features for automatically detecting salient objects in RGB-D images [26].
Most existing approaches for 3D saliency detection either treat the depth feature as an indicator to weight the RGB saliency map [15–18] or consider the 3D saliency map as the fusion of saliency maps of these low-level features [19–22]. It is not clear how to integrate 2D saliency features with depth-induced saliency feature in a better way, and linearly combining the saliency maps produced by these features cannot guarantee better results. Some other more complex combination algorithms have been proposed. These methods combine the depth-induced saliency map with the 2D saliency map either directly [19] or in a hierarchical way to calculate the final RGB-D saliency map [20]. However, because they are restricted by the computed saliency values, these saliency map combination methods are not able to correct incorrectly estimated salient regions. From the above description, the key to 3D saliency detection models is determining how to integrate the depth cues with traditional 2D low-level features.
In this paper, we focus on how to integrate RGB and the additional depth information for RGB-D saliency detection. This saliency-map-level integration is not optimal because it is restricted by the determined saliency values. Conversely, we incorporate colour and depth cues at the feature level within a Bayesian framework.
3 The proposed approach
First, we train a CNN model for depth images by teaching the network to reproduce the mid-level semantic representation learned from RGB images for which there are paired images. Then, deep features of the RGB and depth images are extracted by a CNN.
Second, the class-conditional mutual information (CMI) is computed to measure the dependence of the deep features of the RGB and depth images; then, the posterior probability of the RGB-D saliency is formulated by applying Bayes’ theorem. These two features complement each other in detecting 3D saliency cues from different perspectives and, when combined, yield the final 3D saliency value. By assuming that deep features are Gaussian distributions, the parameters of the Gaussian distribution can be estimated in the DMNB model using a variational inference-based expectation maximization (EM) algorithm.
3.1 Feature extraction using CNN
Most existing saliency detection methods focus on how to design low-level saliency cues or model background priors. Low-level saliency cues alone do not produce good saliency detection results, particularly when salient objects are present in a low-contrast background with confusing visual cues. Objects cannot be classified as salient objects from the low-contrast background either based on low-level saliency cues or background priors, but they are semantically salient in high-level cognition as they are distinct in object categories. Due to its capability of learning high-level semantic features, a CNN is effective for estimating the saliency maps of images and has been used for saliency detection [27, 28]. A CNN is able to generate representative and discriminative hyper-features rather than hand-designing heuristical features for saliency.
3.1.1 Deep features of RGB image
Superpixel segmentation is first performed on RGB-D images [31], and the input of the global-context CNN is a superpixel-centred large context window that includes the full RGB image. Regions that exceed the image boundaries are padded with the mean pixel value of the RGB training dataset. The padded images are then warped to 227 ×3 as input, where the three dimensions represent width, height, and number of channels. With this normalization and padding scheme, the superpixel to be classified is always located at the centre of the RGB image, and the spatial distribution of the global context is normalized in this way. Moreover, it ensures that the input covers the entire range of the original RGB image. We refer readers to [27] for further details.
3.1.2 Deep features of depth image
We demonstrate how we transfer supervision from RGB images to depth images as obtained from a range sensors, such as the Microsoft Kinect, for the downstream task of saliency detection. We consider the domain of RGB images as \(\mathcal {M}_{s}\) for which there is large dataset of labelled images D_{ s }, and we treat depth images as \(\mathcal {M}_{d}\) with limited labelled data D_{ d } for which we would like to train a rich representation for saliency detection. We use convolutional neural networks as our layered rich representation. For our layered image representation models, we use CNNs with the network architecture from the Clarifai model.
We denote the deep features of the RGB image as a corresponding K layered rich representation \(\Phi =\{\phi ^{i}_{\mathcal {M}_{s}, D_{s}}, \forall i \in \, [\!1 \cdots K]\}\). \(\phi ^{i}_{\mathcal {M}_{s}, D_{s}}\) is the i^{ th } layer of the Clarifai model for modality \(\mathcal {M}_{s}\) that has been trained on labelled images from dataset D_{ s }. Now, we want to learn the deep features of depth images from modality \(\mathcal {M}_{d}\), for which we do not have access to a large dataset of labelled depth images. We have already hand-designed an appropriate CNN architecture \(\Psi =\{\psi ^{i}_{\mathcal {M}_{d}}, \forall i \in [1 \cdots L]\}\) from the Clarifai model. The task is then to effectively learn the parameters associated with various operations in the CNN architecture without having access to a large set of annotated images for modality \(\mathcal {M}_{d}\).
where U_{ s,d } denotes the NLPR dataset, which includes paired images from modalities \(\mathcal {M}_{s}\) and \(\mathcal {M}_{d}\). For the loss function f, we use the L_{2} distance between the feature vectors, \(f(\cdot)=||\cdot ||^{2}_{2}\). Then, the deep features of depth images are extracted by CNN Ψ.
3.2 Bayesian framework for saliency detection
where p(z_{ s }|x_{ c },x_{ d }) is shorthand for the probability of predicting whether a pixel is salient, p(x_{ c },x_{ d }) is the likelihood of the observed deep features of RGB images and depth images, and p(z_{ s },x_{ c },x_{ d }) is the joint probability of the latent class and observed features, defined as p(z_{ s },x_{ c },x_{ d })=p(z_{ s })p(x_{ c },x_{ d }|z_{ s }).
In this paper, the class-conditional mutual information (CMI) is used as a measure of the dependence between two features x_{ c } and x_{ d }, which can be defined as I(x_{ c },x_{ d }|z_{ s })=H(x_{ c }|z_{ s })+H(x_{ d }|z_{ s })−H(x_{ c },x_{ d }|z_{ s }), where H(x_{ c }|z_{ s }) is the class-conditional entropy of x_{ c }, defined as \(-\sum _{i} p(\boldsymbol {z}_{s}=i)\sum _{\boldsymbol {x}_{c}}p(\boldsymbol {x}_{c}|\boldsymbol {z}_{s}=i)\log p(\boldsymbol {x}_{c}|\boldsymbol {z}_{s}=i)\). Mutual information is zero when x_{ c } and x_{ d } are mutually independent given class z_{ s } and increases with increasing level of dependence, reaching the maximum when one feature is a deterministic function of the other. Indeed, the independence assumption becomes more accurate with decreasing entropy, which yields an asymptotically optimal performance of the naive Bayes classifier [32].
The visual result for class-conditional mutual information between the deep features of RGB images and depth images on the NLPR dataset is shown in Fig. 5. We employ a CMI threshold τ to discover feature dependencies. For CMI between the deep features of RGB images and depth images less than τ, we assume that x_{ c } and x_{ d } are conditionally independent given the classes z_{ s }, that is, p(x_{ c },x_{ d }|z_{ s })=p(x_{ c }|z_{ s })p(x_{ d }|z_{ s }). This entails the assumption that the distribution of the deep features of RGB images does not change with the deep features of depth images. Thus, the pixel-wise saliency of the likelihood is given by p(z_{ s }|x_{ c },x_{ d })∝p(z_{ s })p(x_{ c }|z_{ s })p(x_{ d }|z_{ s }).
3.3 Generative model for saliency estimation
where θ is the prior distribution over C components, \(\Omega =\{(\mu _{jk},\sigma _{jk}^{2}),[j]_{1}^{N},[k]_{1}^{C}\}\) are the parameters for the distributions of N features, and \(p(\boldsymbol {x}_{j}|\boldsymbol {z}_{j},\Omega _{j})\triangleq \mathcal {N}(\boldsymbol {x}_{j}|\mu _{jk},\sigma _{jk}^{2})\). In two-class classification, y is either 0 or 1 generated from Bern(y|η). Because the DMNB model assumes a generative process for both the labels and features, we use both \(\mathcal {X}=\{(\boldsymbol {x}_{ij}),[i]_{1}^{\mathcal {M}},[j]_{1}^{N}\}\) and \(\mathcal {Y}=\{\boldsymbol {y}_{i},[i]_{1}^{\mathcal {M}}\}\) as a collection of \(\mathcal {M}\) superpixels in trained images from the generative process to estimate the parameters of the DMNB model such that the likelihood of observing \((\mathcal {X},\mathcal {Y})\) is maximized. In practice, we may find a proper C using the Dirichlet process mixture model (DPMM)[33]. The DPMM thus provides a nonparametric prior for the parameters of a mixture model that allows the number of mixture components to increase as the training set increases, as shown in Fig. 6.
The variational parameters (γ^{∗},ϕ^{∗},ξ^{∗}) from the inference step provide the optimal lower bound to the log-likelihood of (x_{ i },y_{ i }), and maximizing the aggregate lower bound \(\sum _{i=1}^{\mathcal {M}}\mathcal {L}(\gamma ^{*},\phi ^{*},\xi ^{*},\alpha,\Omega,\eta)\) over all data points with respect to α, Ω and η, respectively, yields the estimated parameters. For μ, σ and η, we have \(\mu _{jk}=\frac {\sum _{i=1}^{\mathcal {M}}\phi _{ik}\boldsymbol {x}_{ij}}{\sum _{i=1}^{\mathcal {M}}\phi _{ik}}\), \(\sigma _{jk}=\frac {\sum _{i=1}^{\mathcal {M}}\phi _{ik}(\boldsymbol {x}_{ij}-\mu _{jk})^{2}}{\sum _{i=1}^{\mathcal {M}}\phi _{ik}}\), and \(\eta _{k}=\log (\frac {\sum _{i=1}^{\mathcal {M}}\phi _{ik}\boldsymbol {y}_{i}}{\sum _{i=1}^{\mathcal {M}}\frac {\phi _{ik}}{\xi _{i}}})\).
where \(\overline {\boldsymbol {z}}\) is an average of z_{1:N} over all of the observed features. The computation for \(\boldsymbol {E}[\overline {\boldsymbol {z}}]\) is intractable; therefore, we again introduce the distribution q(z_{1:N},θ) and calculate \(\boldsymbol {E}_{q}[\overline {\boldsymbol {z}}]\) as an approximation of \(\boldsymbol {E}[\overline {\boldsymbol {z}}]\). In particular, \(\boldsymbol {E}_{q}[\overline {\boldsymbol {z}}]=\phi \); therefore, we only need to compare η^{ T }ϕ with 0.
3.4 Experimental evaluation
3.5 Generative model for saliency estimation
3.5.1 Evaluation datasets
Comparison of the benchmark and existing 3D saliency detection datasets
Name | Size | Object no. | Scene types | Centre bias |
---|---|---|---|---|
NLPR dataset | 1000 | one (mostly) | 11 | Yes |
NJU-DS400 dataset | 400 | one (mostly) | > 10 | Yes |
Colour-depth saliency, in which both IOU values of RGB images and depth images are more than the IOU threshold, defined as \(\mathcal {D}^{b}=\{\mathcal {I}^{b}_{c}, \mathcal {I}^{b}_{d} \}\), where \(\mathcal {I}^{b}_{c}\) and \(\mathcal {I}^{b}_{d}\) denote RGB images and depth images, respectively.
Colour saliency, in which only IOU values of RGB images are more than the IOU threshold and IOU values of depth images are less than the IOU threshold, defined as \(\mathcal {D}^{c}=\{\mathcal {I}^{c}_{c}, \mathcal {I}^{c}_{d} \}\), where \(\mathcal {I}^{c}_{c}\) and \(\mathcal {I}^{c}_{d}\) denote RGB images and depth images, respectively.
Depth saliency, in which only IOU values of depth images are more than the IOU threshold and IOU values of RGB images are less than the IOU threshold, defined as \(\mathcal {D}^{d}=\left \{\mathcal {I}^{d}_{c}, \mathcal {I}^{d}_{d} \right \}\), where \(\mathcal {I}^{d}_{c}\) and \(\mathcal {I}^{d}_{d}\) denote RGB images and depth images, respectively.
3D saliency situation in terms of the NLPR dataset
Dataset | Colour-depth saliency | Colour saliency | Depth saliency |
---|---|---|---|
NLPR | 76.7% | 20.8% | 2.5% |
3.5.2 Evaluation metrics
There are currently no specific and standardized measures for computing the similarity between the fixation density maps and saliency maps created using computational models in 3D situations. Nevertheless, there is a range of different measures that are widely used to perform comparisons of saliency maps for 2D content. We introduce two types of measures to evaluate algorithm performance on the benchmark. The first one is the gold standard: F-measure. The second is the precision-recall (PR) curve. A continuous saliency map can be converted into a binary mask using a threshold, resulting in a pair of precision and recall values when the binary mask is compared against the ground truth. A PR curve is then obtained by varying the threshold from 0 to 1. The PR curve indicates the mean precision and recall of the saliency map at various thresholds.
3.5.3 Implementation details
We follow the default setup of the MC procedure from [27] for training the depth CNN using the caffe CNN library [34]. For training the depth CNN using supervision transfer, we copy the weights from the RGB CNN [27] that was pre-trained on ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014 [35] and fine-tuned for saliency detection on the MSRA10K dataset [36] to initialize this network, base the learning rate at 0.001 and step it down by a factor of 10 every 1000 iterations, except that we fine-tune all the layers. We randomly select 600 depth images \(\mathcal {I}^{b}_{d}\) for training and 100 for validation from \(\mathcal {D}^{b}\). From each depth image, we select an average 200 of superpixels, and in total, approximately 120 thousand input windows for training and 20 thousand for validation are generated. We label a patch as salient if 50% of the pixels in this patch are salient; otherwise, it is labelled as non-salient. Training of the depth CNN for 10 thousand iterations costs 60 h without a GPU.
3.5.4 Parameter settings
The parameters and their settings in this paper
Name | Range | Description |
---|---|---|
τ | (0,1) | A CMI threshold |
α | (0, 40] | The parameter of a Dirichlet distribution |
θ | (0,1) | The parameter of a multinomial distribution |
η | (− 2.0,2.0) | The parameter of a Bernoulli distribution |
Ω | ((0,1),(0,0.2)) | The parameter of a Gaussian distribution |
C | > 2 | The number of components of DMNB |
We initialize the model parameters using all data points and their labels in the training set in Algorithm 1. In particular, we use the mean and standard deviation of the data points in each class to initialize Ω and the ratio of data points in different classes to initialize α_{ i }.
3.5.5 The effect of the parameters
3.5.6 Compared methods
Let us compare our saliency model (BFSD) with a number of existing state-of-the-art methods, including graph-based manifold ranking (GMR)[7]; multi-context deep learning (MC)[27]; multiscale deep CNN (MDF)[28]; anisotropic centre-surround difference (ACSD)[10]; saliency detection at low-level, mid-level, and high-level stages (LMH)[19]; and exploiting global priors (GP)[20], among which GMR, MC and MDF are developed for RGB images, LMH and GP for RGB-D images, and ACSD for depth images. All of the results are produced using the public codes that are offered by the authors of the previously mentioned literature reports.
3.6 Qualitative experiment
3.6.1 Colour-depth saliency
3.6.2 Colour saliency
3.6.3 Depth saliency
3.7 Quantitative evaluation
Comparison of the average run time (seconds) on the NLPR dataset
Dataset | ACSD | GMR | MC | MDF | LMH | GP | BFSD |
---|---|---|---|---|---|---|---|
NLPR | 0.2 s | 2.9 s | 72.7 s | 2.1×10^{3} s | 2.8 s | 38.9 s | 80.1 s |
Comparison of the F-measure on the NLPR dataset
3D saliency situation | ACSD | GMR | MC | MDF | LMH | GP | BFSD |
---|---|---|---|---|---|---|---|
Colour-depth saliency | 0.5548 | 0.6540 | 0.7381 | 0.6983 | 0.6109 | 0.6891 | 0.7793 |
Colour saliency | 0.5195 | 0.6612 | 0.6684 | 0.6630 | 0.5645 | 0.6480 | 0.7658 |
Depth saliency | 0.5635 | 0.7032 | 0.7711 | 0.7689 | 0.7744 | 0.8095 | 0.9044 |
Overall | 0.5510 | 0.6652 | 0.7366 | 0.7058 | 0.6317 | 0.7082 | 0.8092 |
As shown in Fig. 12c, in the depth saliency situation, the RGB saliency methods perform relatively worse than the RGB-D saliency methods in terms of precision. However, in the colour saliency situation, the ACSD and LMH methods do not perform well in both precision and recall. Although the simple late fusion strategy achieves improvements in the depth saliency situation, as shown in Fig. 12c, it still suffers from inconsistency in the homogeneous foreground regions in the colour saliency situation, as shown in Fig. 12b, which may be attributed to treating the appearance and depth correspondence cues in an independent manner. In the colour-depth saliency situation, due to the capability of learning high-level semantic features, MC and MDF perform relatively better than the LMH and GP methods in terms of F-measure. Although the recall values are very similar, Fig. 12b, c show that the proposed method improves the precision and F-measure when compared to MC and MDF. Our approach consistently detects the pixels on the dominant objects within a Bayesian framework with higher accuracy to resolve the issue. Figure 12 shows that the proposed method performs favourably against the existing algorithm with higher precision, recall values, and F-measure scores on the NLPR dataset.
3.7.1 Supervision transfer vs fine-tuning
This section investigates the effectiveness of different depth CNN learning strategies. It was demonstrated that fine-tuning a deep CNN model for image classification with the target task (e.g. object detection) data can significantly improve the performance of the target task [37]. Supervision transfer enables learning of rich representations from a large labelled modality as a supervisory signal for training representations for a new unlabelled paired modality and can be used as a pre-training procedure for new modalities with limited labelled data. However, the fine-tuning task and the supervision transfer task have disparity in the following aspects. (1) Input data. The fine-tuning task takes the labelled depth images as inputs, while the supervision transfer task requires the paired RGB and depth images. The fine-tuning solve the problem of domain adaptation within the same modality. In contrast, supervision transfer here tackles the problem of domain adaptation across different modalities. (2) The adapted layer. The fine-tuning task adapts the last soft-max layer to the same modality data, while the supervision transfer happens at the arbitrary internal layer for a new image modality. Particularly, deep model structures at the fine-tuning stage are only different in the last fully connected layer for predicting labels. Supervision transfer here allows for transfer of supervision at arbitrary semantic levels. Due to the “data-hungry” nature of CNNs, the existing training data is insufficient for training; therefore, we employed supervision transfer to resolve this issue.
3.7.2 Fusion strategy comparison
3.8 Cross-dataset generalization
3.8.1 Failure cases
3.8.2 Limitations
Because our approach requires training on large datasets to adapt to specific environments, it has the problem that properly tuning the parameters for specific new tasks is important to the performance of the DMNB model. The DMNB model performs classification in one shot via a combination of mixed-membership models and logistic regression, where the results may depend on different choices of C. The learned parameters will clearly have good performances on the specific stimuli but not necessarily on the new testing set. Thus, the weakness of the proposed method is that to obtain reasonable performances, we train our saliency model on the training set for specific C. This problem could be addressed by using Dirichlet process mixture models to find a proper C for new datasets.
4 Conclusion
In this study, we propose a learning-based 3D saliency detection model for RGB-D images that considers the deep features of RGB images and depth images within a Bayesian framework. To better detect semantically salient objects, we employ a deep CNN to model saliency of objects in RGB images and depth images. Rather than simply combining a depth map with 2D saliency maps as in previous studies, we propose a computational saliency detection model for RGB-D images based on the DMNB model. The experiments verify that the deep features of depth images can serve as a helpful complement to the deep features of RGB images within a Bayesian framework. Compared with other competing 3D models, the experimental results from a public RGB-D saliency datasets demonstrate the improved performance of the proposed model over other strategies.
As a future work, we are considering to improve the feature representation of the depth images. We are considering to represent the depth image by three channels (horizontal disparity, height above ground, and angle with gravity) [38] for saliency detection because this representation allows the CNN to learn stronger features than by using disparity alone. We are also considering the application of our 3D saliency detection model in RGB-D object detection problems, e.g. 3D object proposals.
Declarations
Acknowledgements
This work was supported in part by the Beijing Municipal special financial project (PXM2016_278215_000013, ZLXM_2017C010) and by the Innovation Group Plan of Beijing Academy of Science and Technology (IG201506C2).
Authors’ contributions
SW took charge of the system coding, doing experiments, data analysis and writing the whole paper excluding variational EM algorithm part at subsection 3.3. ZZ took charge of advisor position for paper presentation and experiment design. WJ took charge of data analysis presentation as well as English revising. HQ took charge of coding and writing for variational EM algorithm part at subsection 3.3. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Le Callet P, Niebur E (2013) Visual attention and applications in multimedia technology. Proc IEEE 101(9):2058–2067. https://doi.org/10.1109/JPROC.2013.2265801.View ArticleGoogle Scholar
- Wang J, Fang Y, Narwaria M, Lin W, Callet PL (2014) Stereoscopic image retargeting based on 3d saliency detection In: The IEEE International Conference on Acoustics, Speech and Signal Processing, 669–673.. IEEE, Florence.Google Scholar
- Kim H, Lee S, Bovik C (2014) Saliency prediction on stereoscopic videos. IEEE Trans Image Process 23(4):1476–1490. https://doi.org/10.1109/TIP.2014.2303640.MathSciNetView ArticleMATHGoogle Scholar
- Zhang Y, Jiang G, Yu M, Chen K (2010) Stereoscopic visual attention model for 3d video In: The 16th International Conference on Multimedia Modeling, 314–324.. Springer, Chongqing.Google Scholar
- Borji A, Cheng M, Hou Q, Jiang H, Li J (2017) Salient object detection: a survey. arXiv preprint arXiv:1411.5878.Google Scholar
- Borji A, Cheng M, Jiang H, Li J (2015) Salient object detection: a benchmark. IEEE Trans Image Process 24(12):5706–5722. https://doi.org/10.1109/TIP.2015.2487833.MathSciNetView ArticleGoogle Scholar
- Yang C, Zhang L, Lu H, Ruan X, Yang M (2013) Saliency detection via graph based manifold ranking In: The IEEE Conference on Computer Vision and Pattern Recognition, 3166–3173.. IEEE, Portland.Google Scholar
- Peng H, Li B, Ling H, Hu W, Xiong W, Maybank SJ (2017) Salient object detection via structured matrix decomposition. IEEE Trans Pattern Anal Mach Intell 39(4):818–832. https://doi.org/10.1109/TPAMI.2016.2562626.View ArticleGoogle Scholar
- Li B, Xiong W, Hu W (2012) Visual saliency map from tensor analysis In: Proceedings of Twenty-Sixth AAAI Conference on Artificial Intelligence, 1585–1591.. AAAI, Toronto.Google Scholar
- Ju R, Ge L, Geng W, Ren T, Wu G (2014) Depth saliency based on anisotropic centre-surround difference In: IEEE International Conference Image Processing, 1115–1119.. IEEE, Pairs.Google Scholar
- Song H, Liu Z, Xie Y, Wu L, Huang M (2016) Rgbd co-saliency detection via bagging-based clustering. IEEE Sig Process Lett 23(12):1722–1726. https://doi.org/10.1109/LSP.2016.2615293.View ArticleGoogle Scholar
- Lang C, Ngugen T, Katti H, Yadati K, Kankanhalli M, Yan S (2012) Depth matters: influence of depth cues on visual saliency In: The 12th European Conference Computer Vision, 101–105.. Springer, Florence.Google Scholar
- Desingh K, Madhava K, Rajan D, Jawahar C (2013) Depth really matters: improving visual salient region detection with depth In: The British Machine Vision Conference, 98.1–98.11.. BMVA, Bristol.Google Scholar
- Shan H, Banerjee A, Oza N (2009) Discriminative mixed-membership models In: IEEE International Conference Data Mining, 466–475.. IEEE, Miami.Google Scholar
- Fang Y, Wang J, Narwaria M, Le Callet P, Lin W (2014) Saliency detection for stereoscopic images. IEEE Trans Image Process 23(6):2625–2636. https://doi.org/10.1109/TIP.2014.2305100.MathSciNetView ArticleMATHGoogle Scholar
- Wu P, Duan L, Kong L (2015) Rgb-d salient object detection via feature fusion and multi-scale enhancement In: Chinese Conference Computer Vision, 359–368.. Springer, Xi’an.View ArticleGoogle Scholar
- Ciptadi A, Hermans T, Rehg J (2013) An in depth view of saliency In: The British Machine Vision Conference, 9–13.. BMVA, Bristol.Google Scholar
- Iatsun I, Larabi M, Fernandez-Maloigne C (2014) Using monocular depth cues for modeling stereoscopic 3D saliency In: IEEE International Conference Acoustics, Speech and Signal Processing, 589–593.. IEEE, Florence.Google Scholar
- Peng H, Li B, Hu W, Ji R (2014) Rgbd salient object detection: a benchmark and algorithms In: The 13th European Conference Computer Vision, 92–109.. Springer, Zurich.Google Scholar
- Ren J, Gong X, Yu L, Zhou W (2015) Exploiting global priors for rgb-d saliency detection In: IEEE Conference Computer Vision and Pattern Recognition Workshops, 25–32.. IEEE, Boston.Google Scholar
- Xue H, Gu Y, Li Y, Yang J (2015) Rgb-d saliency detection via mutual guided manifold ranking In: IEEE International Conference Image Processing, 666–670.. IEEE, Quebec.Google Scholar
- Wang J, DaSilva M, Le Callet P, Ricordel V (2013) Computational model of stereoscopic 3D visual saliency. IEEE Trans Image Process 22(6):2151–2165. https://doi.org/10.1109/TIP.2013.2246176.MathSciNetView ArticleMATHGoogle Scholar
- Fang Y, Lin W, Fang Z, Lei J, Le Callet P, Yuan F (2014) Learning visual saliency for stereoscopic images In: IEEE International Conference Multimedia and Expo Workshops, 1–6.. IEEE, Chengdu.Google Scholar
- Zhu L, Cao Z, Fang Z, Xiao Y, Wu J, Deng H, Liu J (2015) Selective features for RGB-D saliency In: Conference Chinese Automation Congress, 512–517.. IEEE, Wuhan.Google Scholar
- Bertasius G, Park H, Shi J (2015) Exploiting egocentric object prior for 3d saliency detection. arXiv preprint arXiv:1511.02682.Google Scholar
- Qu L, He S, Zhang J, Tian J, Tang Y, Yang Q (2013) RGBD salient object detection via deep fusion. IEEE Trans Image Process 26(5):2274–2285. https://doi.org/10.1109/TIP.2017.2682981.MathSciNetView ArticleGoogle Scholar
- Zhao R, Ouyang W, Li H, Wang X (2015) Saliency detection by multi-context deep learning In: IEEE Conference Computer Vision and Pattern Recognition Workshops, 1265–1274.. IEEE, Boston.Google Scholar
- Li G, Yu Y (2016) Visual saliency detection based on multiscale deep CNN features. arXiv preprint arXiv:1609.02077.Google Scholar
- Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks In: The 13th European Conference Computer Vision, 818–833.. Springer, Zurich.Google Scholar
- Gupta S, Hoffman J, Malik J (2015) Cross modal distillation for supervision transfer. arXiv preprint arXiv:1507.00448.Google Scholar
- Wang S, Zhou Z, Qu H, Li B (2016) Visual saliency detection for RGB-D images with generative model In: The 13th Asian Conference on Computer Vision, 20–35.. Springer, Taipei.Google Scholar
- Rish I (2001) An empirical study of the naive Bayes classifier. J Univ Comput Sci 3(22):41–46.Google Scholar
- Blei D, Jordan M (2006) Variational inference for dirichlet process mixtures. Bayesian Anal 1(1):121–143.MathSciNetView ArticleMATHGoogle Scholar
- Jia Y (2013) Caffe: An open source convolutional arichitecture for fast feature embedding. http://caffe.berkeleyvision.org/. Accessed 2013.
- Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Li F (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252.MathSciNetView ArticleGoogle Scholar
- Cheng M, Mitra N, Huang X, Torr P, Hu S (2014) Global contrast based on salient region detection. IEEE Trans Image Process 37(3):569–582. https://doi.org/10.1109/TPAMI.2014.2345401.Google Scholar
- Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524.Google Scholar
- Gupta S, Girshick R, Arbelaez P, Malik J (2014) Learning rich features from RGB-D images for object detection and segmentation In: The 13th European Conference Computer Vision, 345–360.. Springer, Zurich.Google Scholar