Combining deep features for object detection at various scales: finding small birds in landscape images
© The Author(s) 2016
Received: 21 April 2016
Accepted: 23 June 2016
Published: 2 August 2016
Demand for automatic bird ecology investigation rises rapidly along with the widespread installation of wind energy plants to estimate their adverse environmental effect. While significant advance in general image recognition has been made by deep convolutional neural networks (CNNs), automatically recognizing birds at small scale together with large background regions is still an open problem in computer vision. To tackle object detection at various scales, we combine a deep detector with semantic segmentation methods; namely, we train a deep CNN detector, fully convolutional networks (FCNs), and the variant of FCNs, and integrate their results by the support vector machines to achieve high detection performance. Through experimental results on a bird image dataset, we show the effectiveness of the method for scale-aware object detection.
Wind turbines, one of the mainstream technologies for cultivating renewable energy sources, are yet at the same time considered serious threats to endangered bird species . Assessments of bird habitats around planned sites are now required for the operators , whereas the surveys rely on experts who conduct manual observations. Automatic bird detection has hence drawn the attention of industry, as it can reduce the cost and increase the accuracy of investigations. It may also assist automatic systems that decelerate the blades or sound an alarm at the approach of birds.
When conducting bird surveillance with fixed-point cameras, however, three issues occur related to resolution and precision.
First, finding various scales of objects in large images has been addressed as a difficult problem because of the large differences in resolution. Second, images of surveillance cameras have different characteristic from those in general image recognition datasets, as objects captured by wide-field-of-view cameras are often ambiguous due to low resolution.
Finally, the number of flying birds is irregular and there are many scenes without any birds; thus, the detector is required to reduce false detections of backgrounds as few as possible for practical use.
To solve these problems, this paper presents a scale-aware bird detection method with practically high precision. Following the idea of scene parsing (e.g., ), we carefully select the combination of methods, each of which are suited for objects at different scales; specifically, a successor  of convolutional neural networks (CNNs)  for small birds and two kinds of fully convolutional networks (FCNs) for larger areas: the original FCNs  and DeepLab . FCN-based methods can recognize both birds and backgrounds, while FCNs is more suited for middle-size birds, and DeepLab is good at backgrounds. Linear SVMs  are used to merge all the features for final results. This paper is based on our previous work  but improved so that features in the selected methods are all based on deep learning.
The proposed method was experimentally evaluated with a bird dataset especially constructed for ecological investigations around wind farms, showing that combining deep features from a detector and semantic segmentation is effective for scale-aware object detection. It achieved precision of 97 % in the bird detection task with 80 % recall rate.
1.1 Related work
The advances in CNNs and the growing availability of large-scale image datasets have brought outstanding improvements in image recognition. In particular, stronger learning models [10, 11] as well as effective techniques for suppressing overfitting  and avoiding the vanishing gradient problem  have significantly improved the performance of CNNs.
Many new detection methods have been proposed along with the advances in CNNs. In popular region-based CNN methods (R-CNN) , a selective search  is first used to identify potentially salient object regions (referred to as region proposal), from which image features are extracted by CNNs and classified by SVMs. We utilize ResNet , one of the most successful networks in detection, while we leave the region proposals as future work and use background subtraction for candidate region selection in this study.
Significant progress has also been made in semantic segmentation. There has been much debate about how to parse both object categories (things) and background categories (stuff), each of which account for smaller and larger parts of images. Various methods parse stuff and things separately with region-based and detector-based methods [3, 16].
FCNs can obtain a coarse object label map from the networks by combining the final prediction layer with lower layers (skip layer) , where the context and localization information are available for pixel-wise labeling.
DeepLab use the hole algorithm , which convolutes every other pixel. This approach can grasp the feature map more sparsely, which improves the ability to recognize background.
An input image is fed into three pipelines: (1) ResNet-based CNNs as a detector for small birds after a background subtraction pre-processing, (2) FCNs as a method that works as a detector but also as a semantic segmentation, and (3) DeepLab as a method that works as a semantic segmentation. SVMs combine the class likelihoods and scores derived from three pipelines. The outcomes of the method are regions estimated to be birds.
2.1 CNNs for bird detection
We designed the CNN network model using ResNet , which achieved the best results in the detection and classification of ILSVRC 2015. In ResNet, the input of a convolutional (conv) layer bypasses one or more layers and is added to the outputs of the stacked layers. Compared with previous net structures, ResNet learns so-called residual mappings, which make the learning easier even with deeper structures.
The rest of the networks follows ; here, we briefly explain it for completeness. In every conv layer, the size of the kernels is 3 ×3. The very first conv layer has 16 kernels. Subsequently, there are four blocks, each of which includes four (2n with n=2) conv layers. The number of kernels is 16, 32, 64, and 128 in each block, respectively. When the dimensions increase by shortcut connections, we use 1×1 convolutions with a stride of 2 to equalize the input and output dimensions.
The first of four conv layers in the second and later blocks includes a stride of two subsamples, and this reduces the feature map size into half. Thus, the feature map size (64 ×64) becomes 64, 32, 16, and 8, after the process of each respective block. Finally, the ends of convolutions are connected using global average pooling, an eight-way fully connected layer (fc 8) and softmax. We use 18 stacked weighted layers in total.
2.2 Combining class likelihoods by SVM
We modified FCNs and DeepLab to have four classes (i.e., bird, sky, forest, and wind turbine), and CNNs have eight classes from its architecture, which we selected them as follows: bird, blade, tower, anemometer, nacelle, hub, forest, and other. The implementation details of FCNs and DeepLab are provided in the training section.
Each of the three pipelines yields a class-wise likelihood or score: FCNs and DeepLab generate pixel-wise likelihoods of classes, whereas CNNs generate a bounding box-wise score of the likelihoods of classes. For SVM training, we use only the pixels at the center of the bounding boxes of candidate regions proposed by the inter-frame difference method in order to reduce calculation time, so that it finishes within a reasonable amount of time. After the first training, we use hard negative mining to reduce false positives and to improve the overall performance. Specifically, image regions of anemometers, night lights, the lower parts of nacelles, in which the FCNs often produce false detections, are added for SVM training. The pixels collected by the inter-frame difference have statistical difference from the true pixel distribution. Because of this, when CNNs are simply combined with semantic segmentation-based methods, the whole framework inclines to include many misdetections by CNNs; thus, we add the background regions (sky, cloud, forest, and wind turbine) inside the candidate bounding boxes in the training.
3 Experimental results
We implemented CNNs, FCNs, and DeepLab, as well as AdaBoost with Haar-like feature [19, 20] and SuperParsing  as baselines. Then, we also trained several combinations of methods with our proposed framework and evaluated their performance using a wide-area surveillance dataset of wild birds , which contains a set of images with 2806 ×3744 pixels taken nearby a wind turbine.
For training of SuperParsing, FCNs, and DeepLab, we picked out 82 images with different weather conditions from the dataset and manually annotated them into four classes: bird, wind turbine, sky, and forest, which are all classes included in . Finally, 77 images out of 82 were used and 5 were omitted since they were too dark due to stormy weather. Except for SuperParsing, the images were cropped to 500 ×500 pixels because the original images were too large to process with FCNs and DeepLab on our GPU memory. Cropping the entire image randomly causes many frames only tagged with the sky labels because more than a half of each image was occupied by sky. With this in mind, we performed cropping around the wind turbine area more intensively, and obtained 70 frames from each image by shifting a 500 ×500 pixel window through the area. Eventually, we had 77 ×70 = 5390 frames for training FCNs and DeepLab.
The training images for ResNet were acquired as candidate regions of moving objects with background subtraction from the entire dataset. The training images include bird and non-bird regions, and we prepared a class of bird and seven background classes.
These extra classes help training the networks because they are frequently included in the candidate regions and likely to cause misdetection. We categorized candidate regions into those eight classes manually. To train the AdaBoost with Haar-like features, we used 15,705 bird images and 18,688 non-bird images similarly collected to train ResNet.
We used an FCN-8s model  pretrained on PASCAL-Context , which contains 59 category (+ background) segmentations. We then fine-tuned the model with the images we prepared for training by using twofold cross validation.
We used an DeepLab-MSc-LargeFOV model  pretrained on PASCAL VOC 2012 , which contains 20 category (+ background) segmentations. We modified the layer “fc8” from 21 outputs to 4: bird, forest, sky, and wind turbine. As FCNs, we then fine-tuned the model with the prepared images by using twofold cross validation.
We trained the ResNet-based model with eight-class training images from scratch. In the same way as , we used the method described in  for weight initialization. In addition, we used batch normalization  to reduce the internal covariate shift and accelerate learning.
AdaBoost with Haar-like features was trained following . Moving object regions were chosen by the inter-frame difference. Then, the proposed regions were marked with square bounding boxes and then trained the detector with the bird and non-bird labels.
We combined the class likelihoods and scores by using pixel-wise SVM training and evaluated the performances of the individual methods and their combinations.
We used 44 of the 77 labeled images that included more birds (183 in total) than the others for the evaluation. The performance of the method is ranked by using the F-measure, i.e., the harmonic mean of precision and recall.
In the evaluation, we regarded detected bounding boxes that had any overlap with ground-truth boxes as correct detections and boxes with no overlap as misdetection.
Similarly, in segmentation-based methods, we regarded the outputs that had any region of overlap with the ground truth as correct detections and those without overlap as misdetections.
F-measure of various methods
AdaBoost with Haar-like features, SuperParsing, and DeepLab are denoted as HA, SP, and DL, respectively. In addition, SP* and FCN* represent the method combined with SVMs. Usually, SP or FCNs output class label with the highest likelihood, while SVMs consider all of the class likelihoods for the output through training.
The upper part of Table 1 shows the results of individual methods. SP and DL achieved the highest precision, while CNNs achieved the best recall rate. FCNs achieved the intermediate score between SP and CNNs. As expected, CNNs highly outperform HA. DL performed similarly to SP, but with much higher recall rate. SP* and FCN* performed better than the ones without SVMs.
F-measure of various methods by size
In all image sizes, the proposed method produces the best F-measure. DL is not suited for detecting tiny images of birds, but CNN+FCN detects tiny bird images more effectively. With DL, the performance is more improved particularly in precision. This shows that FCN detects more birds and DL is good at backgrounds.
Regarding the region proposals obtained by background subtraction, the number of them was about 1000 to 2000 per an input image. As shown in the Fig. 2, almost all the region proposals belong to the forest class. CNN succeeded to filter most of them and contributed to precision.
Almost all images with ambiguous shape were either only detected by CNN or not detected by any methods. In detail, FCN and DeepLab showed too weak reaction to very small birds to detect them. A few bird images over the wind turbine were detected by FCN and DeepLab, but when combined with CNN, they were missed because of low likelihood of birds. There was only one bird image whose parts were occluded; thus, it was hard to train such pattern of bird images.
We combined different types of deep features from a CNN-based detector and fully convolutional networks by using support vector machines to achieve high performance in detecting objects at various scale in large images.
Experiments on a bird image dataset intended for ecological investigations showed that our method detects birds with high precision.
We showed combination of multiple deep convolutional features are effective for scale-aware detection.
This work is in part entrusted by the Ministry of the Environment, JAPAN (MOEJ), the project of which is to examine effective measures for preventing birds, especially sea eagles, from colliding with wind turbines, and by JSPS KAKENHI Grant Number JP16K16083.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Smallwood KS, Rugge L, Morrison ML (2009) Influence of behavior on bird mortality in wind energy developments. J Wildl Manage 73(7): 1082–1098.View ArticleGoogle Scholar
- Bassi S, Bowen A, Fankhauser S (2012) The case for and against onshore wind energy in the UK. Grantham Res. Inst. on Climate Change and Env. Policy Brief.Google Scholar
- Tighe J, Lazebnik S (2013) Finding things: image parsing with regions and per-exemplar detectors In: CVPR, 3001–3008.. IEEE.Google Scholar
- He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition In: Proc. of Computer Vision and Pattern Recognition, 770–778.. IEEE.Google Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks In: NIPS, 1097–1105.Google Scholar
- Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation In: CVPR.. IEEE.Google Scholar
- Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2015) Semantic image segmentation with deep convolutional nets and fully connected CRFs In: ICLR. http://arxiv.org/abs/1412.7062.
- Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3): 421–436.MATHGoogle Scholar
- Takeki A, Tuan Trinh T, Yoshihashi R, Kawakami R, Iida M, Naemura T (2016) Detection of small birds in large images by combining a deep detector with semantic segmentation In: ICIP.. IEEE.Google Scholar
- Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition In: ICLR.. IEEE.Google Scholar
- Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions In: CVPR.. IEEE.Google Scholar
- Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1): 1929–1958.MathSciNetMATHGoogle Scholar
- Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift In: ICML.Google Scholar
- Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation In: CVPR.Google Scholar
- Uijlings JR, van de Sande KE, Gevers T, Smeulders AW (2013) Select search object recognition. IJCV 104(2): 154–171.View ArticleGoogle Scholar
- Dong J, Chen Q, Yan S, Yuille A (2014) Towards unified object detection and semantic segmentation In: ECCV, 299–314.. Springer.Google Scholar
- Hariharan B, Arbeláez P, Girshick R, Malik J (2015) Hypercolumns for object segmentation and fine-grained localization In: CVPR.Google Scholar
- Mallat S (1999) A wavelet tour of signal processing. Academic press.Google Scholar
- Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features In: CVPR, 511.. IEEE.Google Scholar
- Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1): 119–139.MathSciNetView ArticleMATHGoogle Scholar
- Tighe J, Lazebnik S (2013) Superparsing. IJCV 101(2): 329–349.MathSciNetView ArticleGoogle Scholar
- Yoshihashi R, Kawakami R, Iida M, Naemura T (2015) Construction of a bird image dataset for ecological investigations In: ICIP, 4248–4252.. IEEE.Google Scholar
- Mottaghi R, Chen X, Liu X, Cho NG, Lee SW, Fidler S, Urtasun R, Yuille A (2014) The role of context for object detection and semantic segmentation in the wild In: CVPR, 891–898.. IEEE.Google Scholar
- Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) Pascal vis object class (VOC) challenge. IJCV 88(2): 303–338.View ArticleGoogle Scholar
- He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification In: ICCV.. IEEE.Google Scholar
- Pepik B, Benenson R, Ritschel T, Schiele B (2015) What is holding back convnets for detection? In: Patt. Recog, 517–528.. Springer.Google Scholar