Deep residual coalesced convolutional network for efficient semantic road segmentation
© The Author(s) 2017
Received: 20 February 2017
Accepted: 7 March 2017
Published: 15 March 2017
This paper proposes a deep learning-based efficient and compact solution for road scene segmentation problem, named deep residual coalesced convolutional network (RCC-Net). Initially, the RCC-Net performs dimensionality reduction to compress and extract relevant features, from which it is subsequently delivered to the encoder. The encoder adopts the residual network style for efficient model size. In the core of each residual network, three different convolutional layers are simultaneously coalesced for obtaining broader information. The decoder is then altered to upsample the encoder for pixel-wise mapping from the input images to the segmented output. Experimental results reveal the efficacy of the proposed network over the state-of-the-art methods and its capability to be deployed in an average system.
Unlike the traditional object detection and classification which globally works on an image or a patch, the scene segmentation is a pixel-wise classification which requires more accurate boundary localization of each object and area inside the images. For instance in case of the road scene segmentation, one needs to precisely separate the sidewalk for the pedestrian from the road body.
The semantic road scene segmentation, which is the part of the general image segmentation problems, attracts a lot of researchers for providing the best solution. Early works mostly depend on the pixel-wise hand-crafted features (e.g., ) followed by conditional random field (e.g., [2, 3]), the usage of dense depth map , or exploitation of the spatio-temporal parsing  for achieving the best acccuracy.
Since the rise of deep learning for object classification , several attempts were done for designing a deep network architecture for the image segmentation problem. Most of them follow the encoder-decoder architecture style (e.g., [7–9]). Another approach takes advantage of the image patch and spatial prior  for attaining better scene segmentation. Except  which tries to build a small model size network, all of the above works are suffered from either very large network size or slow inference time which make them inconvenient for practical applications.
Here, we aim to establish a compact and effective network for segmenting the road scene. Our approach is inpired by ResNet  which utilizes residual blocks, allowing it to be stacked into a very deep architecture without huge degradation problem. In the heart of our proposed architecture, three different types of convolutional layers are simultaneously coalesced in a residual fashion and stacked it into an encoder-type network for altering the receptive field. Hence, more variational functions are enabled to obtain richer information from the images. Subsequently, a decoder with a lesser architecture followed by a fully connected convolutional (Full Conv.) layer is appended to upsample the encoder and fine-tune the output.
Our contributions are twofold. First, we introduce a coalesced style of the convolutional layers with the residual-flavored network to build an efficient model for the semantic road segmentation. Subsequently, we exhibit an asymmetric encoder-decoder network for reducing the model size even more, unlike the conventional symmetric approach used by the previous methods, e.g., SegNet .
The rest of this paper is organized as follows. Section 2 explains the overall architecture of the proposed RCC-Net. Evaluations against several state-of-the-art methods are described in Section 3. We then conclude the paper and give some future directions of the research in Section 4.
2 Proposed network architecture
2.1 Initial stage
2.2 Residual coalesced convolutional blocks
As the core of our network, we introduce the residual coalesced convolutional (RCC) block which is intemperately instigated by Inception  and ResNet  architectures. The RCC module is composed by projection-receptive-projection sequences with skip connection. The projection parts are realized by 1×1 convolution, while the receptive section consists of a coalesced three different convolutional layers.
The 1×1 convolution is meant to aggregate the activation of each feature in the previous layer. It is eminent for infering the networks with different input size. An ordinary, an asymmetric , and a dilated  convolution layers are subsequently appended in a parallel fashion. This coalesced style is motivated by an assumption that each type of convolution layer contributes different receptive field. By coalescing them, it is expected to have a wider function to be learned, thus increasing the amount of feature information.
Actually, it is interesting to investigate the proper way to combine the convolutional layers. In the experimental section, we will show how the change on its combination, by summing and concatenating them, will affect the entire network results.
The entire encoder contains three stages, where each stage is made from five RCC modules. The ordinary convolution uses 3×3 kernel. Dilation factor of the dilated convolutions is arranged from 2 to 32, while the asymmetric kernels are set to 5 and 7. In between the convolutional operation inside the RCC modules, a parametric rectified linear unit (PReLU) activation layer and a batch n are added. We then place a drop out layer at the end of RCC modules for regularization. A skip connection imitating the ResNet  is coupled for one RCC module. A max-pooling layer is subsequently appended between each stage for downsampling the input.
The decoder is constructed by stacking the same RCC modules as the encoder, except the coalesced convolutional part is now replaced by a deconvolutional layer and the number of stages is decreased. This setting is motivated by , where the role of the pixel recognition should be done mostly by the encoder. The task of the decoder is merely to upsample the output of the encoder and adjust the details. A fully connected convolutional (Full Conv.) layer is thus appended behind the decoder for performing pixel-wise mapping.
Configuration of RCC-Net
3 Results and discussions
In this section, the efficacy of our proposed architecture is demonstrated against several state-of-the-art methods on the road scene segmentation problems. All implementations of the proposed algorithm were done on a Linux PC (Ubuntu 16.04, Core i7, 32 GB RAM), with a GTX 1080 GPU and Torch7. Training was performed using Adam optimization  for 200 epoch with learning rate 10e-3, momentum 0.9, and batch size 8.
3.1 CamVid dataset benchmark
Comparison on the CamVid dataset  using 11 road scene categories (in percent)
Local label descriptor 
Boosting+pairwise CRF 
Dense depth map 
Super parsing 
From Table 2, the proposed RCC-Net (concatenated version) exceeds the existing state-of-the-art methods in four different class categories and the overall class average accuracy. Three-out-four winning categories constitute the small area and objects with lesser training data. It means our proposed method is capable for capturing objects which are difficult to segment. The best class average accuracy and a comparable intersection-over-union (IoU) imply the RCC-Net has a high consistency for achieving good results in each category.
As we have noted in the previous section, it is intriguing to examine different ways of coalescing the convolutional layers. Both summing and concatenated convolutional layers of the RCC-Net surpass the other methods. Nevertheless, the concatenated version of RCC-Net has advantages over the summing one. One interesting result is the pedestrian segmentation of the summing version of the RCC-Net achieves the highest accuracy (70.6%). This fact may lead to a promising application in the future research, e.g., to determine the salient regions for the pedestrian detection.
3.2 Test on wild scene
3.3 Computation time and model size
On GTX 1080, the RCC-Net took 25.5 ms for the forward inference of 480×360 images, including fetching and displaying the image. It is also able to run one inference on a car-deployable mini PC Zotac EN-761 in 67.5 ms with the network size of 4.9 MB, which draws out the power consumption around 62.4 watt. It means the proposed network is fast and small enough to enable the Advanced Driver Assistance System (ADAS). We plan to run the network on a GPU-based embedded system, such as NVIDIA Jetson TK1 for further investigation1.
An efficient and compact solution for solving the semantic road segmentation problem has been presented. By coalescing different types of convolutional layers and stacking them in a deep residual network style, we achieve the high-quality results on the semantic road segmentation with relatively small model size, surpassing the existing state-of-the-art methods. In the future, we would like to examine the performance of our RCC-Net on the boarder problems, such as medical images and other challenging image segmentation dataset, for understanding its capabilities to solve more general segmentation applications.
1 The progress of RCC-Net performance on the embedded system can be seen at http://te.ugm.ac.id/~igi/?page_id=826
This work was supported by AUN-SEED Net Collaborative Research for Alumni (CRA) Project.
IA performed the primary development and analysis for this work and the initial drafting of the manuscript. TBA played an essential role in development of this work and editing the paper. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Yang Y, Li Z, Zhang L, Murphy C, Hoeve JV, Jiang H (2012) Local label descriptor for example based semantic image labeling In: Proc. of European Converence on Computer Vision (ECCV), 361–375.Google Scholar
- Sturgess P, Alahari K, Ladicky L, H.S.Torr P (2009) Combining appearance and structure from motion features for road scene understanding In: Proc. of British Machine Vision Conferenve (BMVC).Google Scholar
- Ladicky L, Sturgess P, Alahari K, Russell C, Torr PHS (2010) What, where and how many? Combining object detectors and CRFs In: Proc. of European Converence on Computer Vision (ECCV), 424–437.Google Scholar
- Zhang C, Wang L, Yang R (2010) Semantic segmentation of urban scenes using dense depth maps In: Proc. of European Converence on Computer Vision (ECCV), 708–721.Google Scholar
- Tighe J, Lazebnik S (2013) Superparsing. Int J Comput Vision (IJCV) 101(2): 329–349.View ArticleGoogle Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks In: Proc. of NIPS, 1097–1105.Google Scholar
- Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation In: Proc. of IEEE Computer Vision and Pattern Recognition (CVPR), 3431–3440.Google Scholar
- Badrinarayanan V, Kendall A, Cipolla R (2015) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. arXiv: 1511.00561.Google Scholar
- Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv: 1606.02147v1.Google Scholar
- Brust CA, Sickert S, Simon M, Rodner E, Denzler J (2015) Convolutional patch networks with spatial prior for road detection and urban scene understanding In: Proc. of VISAPP.Google Scholar
- He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv: 1512.03385.Google Scholar
- Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv: 1512.00567.Google Scholar
- Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions In: Proc. of IEEE Computer Vision and Pattern Recognition (CVPR), 1–9.Google Scholar
- Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv: 1511.07122.Google Scholar
- Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv: 1412.6980.Google Scholar
- Brostow GJ, Shotton J, Fauqueur J, Cipolla R (2008) Segmentation and recognition using structure from motion point clouds In: Proc. of European Converence on Computer Vision (ECCV), 44–57.Google Scholar