Effective hyperparameter optimization using Nelder-Mead method in deep learning
- Yoshihiko Ozaki^{1, 2},
- Masaki Yano^{1, 2} and
- Masaki Onishi^{1, 2}Email author
https://doi.org/10.1186/s41074-017-0030-7
© The Author(s) 2017
Received: 9 January 2017
Published: 10 November 2017
Abstract
In deep learning, deep neural network (DNN) hyperparameters can severely affect network performance. Currently, such hyperparameters are frequently optimized by several methods, such as Bayesian optimization and the covariance matrix adaptation evolution strategy. However, it is difficult for non-experts to employ these methods. In this paper, we adapted the simpler coordinate-search and Nelder-Mead methods to optimize hyperparameters. Several hyperparameter optimization methods were compared by configuring DNNs for character recognition and age/gender classification. Numerical results demonstrated that the Nelder-Mead method outperforms the other methods and achieves state-of-the-art accuracy for age/gender classification.
Keywords
1 Introduction
The evolution of deep neural networks (DNNs) has dramatically improved the accuracy of character recognition [1], object recognition [2, 3], and other tasks. However, the their increasing complexity increases the number of hyperparameters, which makes tuning of hyperparameters an intractable task.
Traditionally, DNN hyperparameters are adjusted using manual search, grid search, or random search [4]. However, search space expands exponentially relative to the number of hyperparameters; thus, such naive methods no longer work well. Therefore, more sophisticated hyperparameter optimization methods are required.
This means that no analytical properties of the objective function, e.g., its derivatives, can be optimized. In addition, a loss function of the target DNN is typically chosen as f(x), and its evaluation cost is so expensive that training and testing of the DNN is required. The search space χ comprises combinations of multiple conditions such as real numbers, integers, and categories.
Currently, Bayesian optimization [5] and the covariance matrix adaptation evolution strategy (CMA-ES) [6] are considered the most promising methods for DNN hyperparameter optimization, and their optimization ability has been proven experimentally [7–10]. However, Bayesian optimization has some hyperparameters that significantly affect its optimization performance, e.g., choices of its kernel and acquisition function. Moreover, maximizing a non-convex acquisition function is required for each iteration of the optimization process. On the other hand, CMA-ES requires several populations and generations for sufficient performance. Although such calculations can be parallelized easily, significant computing resources are required.
It is evident that simple classical manual search, grid search, and random search remain common; thus, we consider that most people are unwilling to adjust the hyperparameters of a difficult optimization method or implement the method and do not have sufficient computing resources to optimize DNN hyperparameters.
In this paper, we describe simple substitutional methods, i.e., the coordinate-search and Nelder-Mead methods, for hyperparameter optimization in deep learning. To the best of our knowledge, no report has examined the application of these methods to hyperparameter optimization.
Our numerical results indicate that these methods are more efficient than other well-known methods. In particular, the Nelder-Mead method is the most effective for deep learning.
2 Related work
2.1 Random search
Random search is one of the simplest ways to optimize DNN hyperparameters. This method iteratively generates hyperparameter settings and evaluates the objective function. Random search has excellent parallelization and can handle integer and categorical hyperparameters naturally. Bergstra and Bengio demonstrated that random search outperforms a manual search by a human expert and grid search [4].
2.2 Bayesian optimization
Bayesian optimization is one of the most remarkable hyperparameter optimization methods in recent years. Its base concept was proposed in the 1970s; however, it has been significantly improved since then due to the attention paid to DNN hyperparameter optimization.
There are several variations of Bayesian optimization, e.g., Gaussian process (GP)-based variation [11], Tree-structured Parzen Estimator (TPE) [7], and Deep Networks for Global Optimization (DNGO) [12]. The most standard one is the GP-based variation.
GP-based Bayesian optimization is shown in Algorithm 1.
is commonly used in Bayesian optimization [8]. Here, θ _{0}, θ _{1}, …, and θ _{ D } are the kernel’s hyperparameters.
The point that maximizes the acquisition function becomes a new sample point. Although maximizing the non-convex acquisition function is difficult, the evaluation cost of the function is considerably less than that of the original objective function. Therefore, it is easier to handle than the original problem.
Practically, Bayesian optimization is combined with random search to collect initial observation data.
Bergstra et al. and Snoek et al. performed several computational experiments. The results demonstrated that Bayesian optimization outperforms manual search by a human expert and random search [7, 8].
2.3 Covariance matrix adaptation evolution strategy
While Bayesian optimization has been developed in the machine learning community, CMA-ES has been developed in the optimization community. CMA-ES is a type of evolutionary computation that demonstrates outstanding performance in benchmarks as a state-of-the-art black box optimization method [13].
The (μ _{W}, λ)-CMA-ES [6] is shown in Algorithm 2. It conducts a weighted recombination from the μ best out of λ individuals. The procedure is explained as follows.
- (i)
Initialize the mean \(\langle \mathbf {x} \rangle _{\mathrm {w}}^{(0)}\) and the standard deviation σ ^{(0)} of individuals. Set the evolution path \(\mathbf {p}_{\mathrm {c}}^{(0)} = \mathbf {p}_{\sigma }^{(0)} = \mathbf {0}\) and the covariance matrix C ^{(0)}=I.
- (ii)Generate g+1 generation individuals \(\mathbf {x}_{k}^{(g+1)} \; (k = 1,\dots,\lambda)\):$$\begin{array}{@{}rcl@{}} &&\mathbf{x}^{(g + 1)}_{k} = {\langle \mathbf{x} \rangle}_{\mathrm{w}}^{(g)} + \sigma^{(g)}\mathbf{B}^{(g)}\mathbf{D}^{(g)}\mathbf{z}_{k}^{(g + 1)},\\ &&\text{where\ } \begin{aligned} &{\langle \mathbf{x} \rangle}_{\mathrm{w}}^{(g)} := \frac{1}{\sum_{i = 1}^{\mu} w_{i}}\sum_{i = 1}^{\mu} w_{i}\mathbf{x}_{i:\lambda}^{(g)},\\ &\mathbf{B}^{(g)}\mathbf{D}^{(g)}\left(\mathbf{B}^{(g)}\mathbf{D}^{(g)}\right)^{T} = \mathbf{C}^{(g)},\\ &\mathbf{z}_{k}^{(g + 1)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned} \end{array} $$(10)
Here, w _{1},…,w _{ μ } are weights, and i : λ denotes the ith best individual.
- (iii)Update the evolution path \(\mathbf {p}^{(g)}_{\mathrm {c}}\) and the covariance matrix C ^{(g)}:$$\begin{array}{@{}rcl@{}} {}&&\mathbf{p}^{(g+1)}_{\mathrm{c}} = (1\! -\! c_{\mathrm{c}}) \mathbf{p}_{\mathrm{c}}^{(g)} \!+ c_{\mathrm{c}}^{\mathrm{u}} c_{\mathrm{w}} \mathbf{B}^{(g)}\mathbf{D}^{(g)} {\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g + 1)}\!, \end{array} $$(11)$$\begin{array}{@{}rcl@{}} {}&&\mathbf{C}^{(g+1)} = (1\! -\! c_{\text{cov}}) \mathbf{C}^{(g)} \!+ c_{\text{cov}} \mathbf{p}^{(g + 1)}_{\mathrm{c}}\left(\mathbf{p}_{\mathrm{c}}^{(g + 1)}\right)^{T}\!\!,\\ {}&&\text{where}\ \begin{aligned} &c_{\mathrm{c}}^{\mathrm{u}} := \sqrt{c_{\mathrm{c}}(2 - c_{\mathrm{c}})},\\ &c_{\mathrm{w}} := \frac{\sum_{i = 1}^{\mu} w_{i}}{\sqrt{\sum_{i = 1}^{\mu} w_{i}^{2}}},\\ &{\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g + 1)} := \frac{1}{\sum_{i = 1}^{\mu} w_{i}}\sum_{i = 1}^{\mu} w_{i} \mathbf{z}_{i : \lambda}^{(g+1)}. \end{aligned} \end{array} $$(12)
Here, c _{c} and c _{cov} are hyperparameters.
- (iv)Update the evolution path \(\mathbf {p}^{(g)}_{\sigma }\) and the step size σ ^{(g)}:$$\begin{array}{@{}rcl@{}} &&\mathbf{p}_{\sigma}^{(g + 1)} = (1 - c_{\mathrm{\sigma}}) \mathbf{p}_{\sigma}^{(g)} + c_{\mathrm{\sigma}}^{\mathrm{u}} c_{\mathrm{w}} \mathbf{B}^{(g)} {\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g+1)}, \end{array} $$(13)$$\begin{array}{@{}rcl@{}} &&\sigma^{(g + 1)} = \sigma^{(g)} \exp\left(\frac{1}{d_{\mathrm{\sigma}}} \frac{\|\mathbf{p}_{\mathrm{\sigma}}^{(g + 1)} - \hat{\mathbf{\chi}}_{\mathrm{n}}\|}{\hat{\mathbf{\chi}}_{\mathrm{n}}} \right),\\ &&\text{where\ } \begin{aligned} &c_{\mathrm{\sigma}}^{\mathrm{u}} := \sqrt{c_{\mathrm{\sigma}} (2 - c_{\mathrm{\sigma}})},\\ &\hat{\mathbf{\chi}}_{\mathrm{n}} = \mathbb{E}[\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|]. \end{aligned} \end{array} $$(14)
Here, c _{σ} and d _{σ} are hyperparameters.
Details about the hyperparameters used to update CMA-ES parameters are provided in the literature [6].
Since evaluations of each individual for each generation can be calculated simultaneously, CMA-ES can be parallelized easily. Watanabe and Le Roux and Loshchilov et al. demonstrated that CMA-ES outperforms manual search by a human expert and Bayesian optimization in certain cases [9, 10].
3 Coordinate-search and Nelder-Mead methods
In the previous section, we introduced the random search, Bayesian optimization, and CMA-ES methods. Note that the achievements of these methods have already been proven experimentally, and the results indicate that the latter two methods are very promising and considered superior to random search. However, both Bayesian optimization and CMA-ES have many hyperparameters related to their optimization performance. To set these hyperparameters appropriately, it is necessary to have sufficient knowledge about the given method. In addition, Bayesian optimization must maximize its non-convex acquisition function, and CMA-ES requires significant computing resources to exploit its advantages. These factors make it difficult for non-experts to utilize these methods.
In this section, we introduce two optimization methods, coordinate-search and Nelder-Mead, that are easy to implement. These methods have fewer hyperparameters to adjust and are practically usable with fewer computing resources.
3.1 Mathematical concepts
Before introducing the methods, we define some required mathematical concepts here.
Definition 1
Definition 2
A positive spanning set in \(\mathbb {R}^{n}\) is a set of vectors whose positive span is \(\mathbb {R}^{n}\).
Definition 3
The set \([\!\mathbf {v}_{1} \cdots \mathbf {v}_{r}] \in \mathbb {R}^{n}\) is considered positively dependent if one of the vectors is in the positive span of the remaining vectors; otherwise, the set is considered positively independent.
Definition 4
Definition 5
A simplex of dimension m is a convex hull of an affinely independent set of points Y={y ^{0},y ^{1},…,y ^{ m }}.
3.2 Coordinate-search method
- (i)
Poll step: Order the poll set P _{ k }={x _{ k }+α _{ k } d:d∈D _{⊕}}. Evaluate f at the poll points in order. If a poll point x _{ k }+α _{ k } d _{ k } that satisfies the condition f(x _{ k }+α _{ k } d _{ k })<f(x _{ k }) is found, then stop polling, set x _{ k+1}=x _{ k }+α _{ k } d _{ k }, and declare the iteration and poll step successful.
- (ii)
Parameter update: If iteration k succeeds, then set α _{ k + 1} = α _{ k } (or α _{ k+1} = 2α _{ k }). Otherwise, set α _{ k + 1} =α _{ k }/2.
When the step size becomes sufficiently small, the search is terminated. Note that the evaluation of functions in the poll step can be parallelized.
This method deteriorates the performance for search ranges with different scales; thus, in this study, we normalize parameters to [ 0,1] in our computational experiments. In addition, we adopt the updating rule α _{ k+1}=2α _{ k } on iteration success and order the vectors of the poll set randomly for each iteration.
3.3 Nelder-Mead method
- (i)Order: Order the n+1 vertices Y={y ^{0},y ^{1},…,y ^{ n }} as follows:$$ f^{0} = f(\mathbf{y}^{0}) \leq f^{1} = f(\mathbf{y}^{1}) \leq \cdots \leq f^{n} = f(\mathbf{y}^{n}). $$
- (ii)Reflect: Reflect the worst vertex y ^{ n } over the centroid \(\mathbf {y}^{c} = \sum _{i = 0}^{n - 1}\mathbf {y}^{i} / n\) of the remaining n vertices:Evaluate f ^{ r }=f(y ^{ r }). If f ^{0}≤f ^{ r }<f ^{ n−1}, then replace y ^{ n } with the reflected point y ^{ r } and terminate iteration k: Y _{ k+1}={y ^{0},y ^{1},…,y ^{ n−1},y ^{ r }}.$$ \mathbf{y}^{r} = \mathbf{y}^{c} + \delta^{r}(\mathbf{y}^{c} - \mathbf{y}^{n}). $$
- (iii)Expand: If f ^{ r }<f ^{0}, calculate:and evaluate f ^{ e }=f(y ^{ e }). If f ^{ e }≤f ^{ r }, then replace y ^{ n } with the expansion point y ^{ e } and terminate iteration k: Y _{ k+1}={y ^{0},y ^{1},…,y ^{ n−1},y ^{ e }}. Otherwise, replace y ^{ n } with the reflected point y ^{ r } and terminate iteration k: Y _{ k+1}={y ^{0},y ^{1},…,y ^{ n−1},y ^{ r }}.$$ \mathbf{y}^{e} = \mathbf{y}^{c} + \delta^{e} (\mathbf{y}^{c} - \mathbf{y}^{n}) $$
- (iv)Contract: If f ^{ r }≥f ^{ n−1}, then a contraction is performed between the best of y ^{ r } and y ^{ n }.
- (a)Outside contraction: If f ^{ r }<f ^{ n }, perform an outside contraction:and evaluate f ^{ oc }=f(y ^{ oc }). If f ^{ oc }≤f ^{ r }, then replace y ^{ n } with the outside contraction point \(\mathbf {y}^{oc}_{k}\) and terminate iteration k: Y _{ k+1}={y ^{0},y ^{1},…,y ^{ n−1},y ^{ oc }}. Otherwise, perform a shrink.$$ \mathbf{y}^{oc} = \mathbf{y}^{c} + \delta^{oc}(\mathbf{y}^{c} - \mathbf{y}^{n}) $$
- (b)Inside contraction: If f ^{ r }≥f ^{ n }, perform an inside contraction:and evaluate f ^{ ic }=f(y ^{ ic }). If f ^{ ic }<f ^{ n }, then replace y ^{ n } with the inside contraction point y ^{ ic } and terminate iteration k: Y _{ k+1}={y ^{0},y ^{1},…,y ^{ n−1},y ^{ ic }}. Otherwise, perform a shrink.$$ \mathbf{y}^{ic} = \mathbf{y}^{c} + \delta^{ic}(\mathbf{y}^{c} - \mathbf{y}^{n}) $$
- (a)
- (v)
Shrink: Evaluate f at the n points y ^{0}+γ ^{ s }(y ^{ i }−y ^{0}),where i=1,…,n, replace y ^{1},…,y ^{ n } with these points, and terminate iteration k: Y _{ k+1}={y ^{0}+γ ^{ s }(y ^{ i }−y ^{0}), i=0,…,n}.
Note that each step of an iteration, e.g., initialization and shrink operations, can be parallelized easily.
4 Poor hyperparameter setting detection
The advantage of human experts is that they can detect training failures and terminate them at an early stage. Domhan et al. proposed a method that accelerates hyperparameter optimization methods by detecting and terminating such training failures using learning curve prediction [17]. In addition, Klein et al. proposed a specialized Bayesian neural network to model DNN learning curves [18, 19]. We apply Algorithm 5 to detect training failures at an early stage.
Note that this method does not optimize hyperparameters directly, but accelerates a hyperparameter optimization method. If a large number of training iterations with poor hyperparameter settings appear in the optimization process, this detection process improves the execution time of the optimization method.
In our experiments, we apply this method to all hyperparameter optimization methods with n equaling 10% of the training iterations and t equaling 0.8. These values are chosen based on experience. As can be seen in Fig. 6, the learning curve of poor hyperparameter settings is distinctive and easy to detect; thus, there is no need to be too careful to decide n and t.
5 Numerical results
We perform computational experiments to optimize real and integer hyperparameters in combination with various datasets, tasks, and convolutional neural networks (CNNs) to compare the performance of the random search, Bayesian optimization, CMA-ES, coordinate-search, and Nelder-Mead methods.
Experimental setting for each method
Method | Detail |
---|---|
Random search | Perform 600 random evaluations. |
Bayesian optimization | Initialize the observation data with the first 100 evaluations of the random search, then perform the optimization with exactly 500 evaluations. The kernel is the ARD Matérn 5/2 and the acquisition function is the EI [8, 10]. |
CMA-ES | Perform 600 evaluations with 20 generations where each generation consists of 30 individuals. \(\langle \mathbf {x} \rangle _{w}^{(0)} = 0.5\), σ ^{(0)}=0.2. All variables are scaled to [0,1] [10]. |
Coordinate-search method | Initialize x _{0} as the best point of the first 100 random search evaluations, then perform optimization for up to 500 evaluations. α=0.5. All variables are scaled to [0,1]. |
Nelder-Mead method | Generate an initial simplex randomly, then perform optimization for up to 600 evaluations (including initialization). \(\gamma ^{s} = \frac {1}{2}, \delta ^{ic} = -\frac {1}{2}, \delta ^{oc} = \frac {1}{2}, \delta ^{r} = 1\ \text {and}\ \delta ^{e} = 2\). |
5.1 MNIST
These methods are also applied to the optimization of hyperparameters of the Batch-Normalized Maxout Network in Network proposed by Chang et al. [25]. Note that this network is deeper and has many more hyperparameters to optimize than LeNet.
LeNet network architecture [1]
Conv 1 | Kernel size: 2, stride: 1, pad: 0 |
Pool 1 (MAX pooling) | Kernel size: 2, stride: 2, pad: 2 |
Conv 2 | Kernel size: 5, stride: 1, pad: 0 |
Pool 2 (MAX pooling) | Kernel size: 2, stride: 2, pad: 0 |
FC 1 |
LeNet fixed parameters
Name | Description |
---|---|
Iteration | 10,000 |
Batch size | 64 |
Learning rate decay policy | inv (gamma=0.01, power=0.75) [29] |
LeNet hyperparameters
Name | Description | Range |
---|---|---|
x _{1} | Learning rate (\(= 0.1^{x_{1}}\phantom {\dot {i}\!}\)) | [ 1,4] |
x _{2} | Momentum (\(= 1 - 0.1^{x_{2}}\phantom {\dot {i}\!}\)) | [ 0.5,2] |
x _{3} | L2 weight decay | [ 0.001,0.01] |
\(x_{4}^{*}\) | FC1 units | [ 256,1024] |
Network architecture of Batch-Normalized Maxout Network in Network [25]
Conv 1 | Kernel size: 5, stride: 1, pad: 2, BN |
MMLP 1-1 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
MMLP 1-2 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
Pool 1 (AVE pooling) | Kernel size: 3, stride: 2, pad: 0, dropout |
Conv 2 | Kernel size: 5, stride: 1, pad: 2, BN |
MMLP 2-1 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
MMLP 2-2 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
Pool 2 (AVE pooling) | Kernel size: 3, stride: 2, pad: 0, dropout |
Conv 3 | Kernel size: 3, stride: 1, pad: 1, BN |
MMLP 3-1 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
MMLP 3-2 | Kernel size: 1, stride: 1, pad: 0, k = 5, BN |
Pool 3 (AVE pooling) |
Fixed parameters of Batch-Normalized Maxout Network in Network
Name | Description |
---|---|
Iteration | 20,000 |
Batch size | 100 |
Learning rate decay policy | Multistep (gamma=0.1, step value={15,000,18,000}) [29] |
Batch-Normalized Maxout Network in Network hyperparameters
Name | Description | Range |
---|---|---|
x _{1} | Learning rate (\(= 0.1^{x_{1}}\phantom {\dot {i}\!}\)) | [ 0.5,2] |
x _{2} | Momentum (\(= 1 - 0.1^{x_{2}}\phantom {\dot {i}\!}\)) | [ 0.5,2] |
x _{3} | L2 weight decay | [ 0.001,0.01] |
x _{4} | Dropout 1 | [ 0.4,0.6] |
x _{5} | Dropout 2 | [ 0.4,0.6] |
x _{6} | Conv 1 initialization deviation | [ 0.01,0.05] |
x _{7} | Conv 2 initialization deviation | [ 0.01,0.05] |
x _{8} | Conv 3 initialization deviation | [ 0.01,0.05] |
x _{9} | MMLP 1-1 initialization deviation | [ 0.01,0.05] |
x _{10} | MMLP 1-2 initialization deviation | [ 0.01,0.05] |
x _{11} | MMLP 2-1 initialization deviation | [ 0.01,0.05] |
x _{12} | MMLP 2-2 initialization deviation | [ 0.01,0.05] |
x _{13} | MMLP 3-1 initialization deviation | [ 0.01,0.05] |
x _{14} | MMLP 3-2 initialization deviation | [ 0.01,0.05] |
5.2 Age and gender classification
Network architecture of the age/gender classification CNN [26]
Conv 1 | Kernel size: 7, stride: 4, pad: 0 |
Pool 1 (MAX pooling) | Kernel size: 3, stride: 2, pad: 0 |
Conv 2 | Kernel size: 5, stride: 1, pad: 2 |
Pool 2 (MAX pooling) | Kernel size: 3, stride: 2, pad: 0 |
Conv 3 | Kernel size: 3, stride: 1, pad: 1 |
Pool 3 (MAX pooling) | Kernel size: 3, stride: 2, pad: 0 |
FC 1 | Dropout |
FC 2 | Dropout |
FC 3 |
Fixed parameters of the age/gender classification CNN
Name | Description |
---|---|
Iteration | 20,000 |
Batch size | 50 |
Learning rate decay policy | Step (gamma=0.1, step size=10000) [29] |
Hyperparameters of the age/gender classification CNN
Name | Description | Range |
---|---|---|
x _{1} | Learning rate (\(= 0.1^{x_{1}}\phantom {\dot {i}\!}\)) | [ 1,4] |
x _{2} | Momentum (\(= 1 - 0.1^{x_{2}}\phantom {\dot {i}\!}\)) | [ 0.5,2] |
x _{3} | L2 weight decay | [ 0.001,0.01] |
x _{4} | Dropout 1 | [ 0.4,0.6] |
x _{5} | Dropout 2 | [ 0.4,0.6] |
\(x_{6}^{*}\) | FC 1 units | [ 512,1024] |
\(x_{7}^{*}\) | FC 2 units | [ 256,512] |
x _{8} | Conv 1 initialization deviation | [ 0.01,0.05] |
x _{9} | Conv 2 initialization deviation | [ 0.01,0.05] |
x _{10} | Conv 3 initialization deviation | [ 0.01,0.05] |
x _{11} | FC 1 initialization deviation | [ 0.001,0.01] |
x _{12} | FC 2 initialization deviation | [ 0.001,0.01] |
x _{13} | FC 3 initialization deviation | [ 0.001,0.01] |
x _{14} | Conv 1 bias | [ 0,1] |
x _{15} | Conv 2 bias | [ 0,1] |
x _{16} | Conv 3 bias | [ 0,1] |
x _{17} | FC 1 bias | [ 0,1] |
x _{18} | FC 2 bias | [ 0,1] |
\(x_{19}^{*}\) | Normalization 1 localsize (=2x _{19}+3) | [ 0,2] |
\(x_{20}^{*}\) | Normalization 2 localsize (=2x _{20}+3) | [ 0,2] |
x _{21} | Normalization 1 alpha | [ 0.0001,0.0002] |
x _{22} | Normalization 2 alpha | [ 0.0001,0.0002] |
x _{23} | Normalization 1 beta | [ 0.5,0.95] |
x _{24} | Normalization 2 beta | [ 0.5,0.95] |
5.3 Results
MNIST results (LeNet)
Method | Mean loss | Min loss |
---|---|---|
Random search | 0.005411 (±0.001413) | 0.002781 |
Bayesian optimization | 0.004217 (±0.002242) | 0.000089 |
CMA-ES | 0.000926 (±0.001420) | 0.000047 |
Coordinate-search method | 0.000052 (±0.000094) | 0.000002 |
Nelder-Mead method | 0.000029 (±0.000029) | 0.000004 |
Method | Mean accuracy (%) | Accuracy with min loss (%) |
Random search | 98.98 (±0.08) | 99.06 |
Bayesian optimization | 99.07 (±0.02) | 99.25 |
CMA-ES | 99.20 (±0.08) | 99.30 |
The coordinate-search method | 99.26 (±0.05) | 99.35 |
The Nelder-Mead method | 99.24 (±0.04) | 99.28 |
MNIST Results (Batch-Normalized Maxout Network in Network)
Method | Mean loss | Min loss |
---|---|---|
Random search | 0.045438 (±0.002142) | 0.042694 |
Bayesian optimization | 0.045636 (±0.001197) | 0.044447 |
CMA-ES | 0.045248 (±0.002537) | 0.042250 |
Coordinate-search method | 0.045131 (±0.001088) | 0.043639 |
Nelder-Mead method | 0.044549 (±0.001079) | 0.043238 |
Method | Mean accuracy (%) | Accuracy with min loss (%) |
Random search | 99.56 (±0.02) | 99.58 |
Bayesian optimization | 99.47 (±0.05) | 99.59 |
CMA-ES | 99.49 (±0.14) | 99.59 |
Coordinate-search method | 99.48 (±0.04) | 99.53 |
Nelder-Mead method | 99.53 (±0.00) | 99.54 |
Gender classification results
Method | Mean loss | Min loss |
---|---|---|
Random search | 0.001732 (±0.000540) | 0.000984 |
Bayesian optimization | 0.00183 (±0.000547) | 0.001097 |
CMA-ES | 0.001804 (±0.000480) | 0.001249 |
Coordinate-search method | 0.002240 (±0.001448) | 0.000378 |
Nelder-Mead method | 0.000395 (±0.000129) | 0.000245 |
Method | Mean accuracy (%) | Accuracy with min loss (%) |
Random search | 87.93 (±0.24) | 88.21 |
Bayesian optimization | 88.07 (±0.27) | 87.85 |
CMA-ES | 88.20 (±0.38) | 88.55 |
Coordinate-search method | 87.04 (±0.52) | 87.72 |
Nelder-Mead method | 88.38 (±0.47) | 88.83 |
Age classification results
Method | Mean loss | Min loss |
---|---|---|
Random search | 0.035694 (±0.006958) | 0.026563 |
Bayesian optimization | 0.024792 (±0.003076) | 0.020466 |
CMA-ES | 0.031244 (±0.010834) | 0.016952 |
Coordinate-search method | 0.032244 (±0.006109) | 0.024637 |
Nelder-Mead method | 0.015492 (±0.002276) | 0.013556 |
Method | Mean accuracy (%) | Accuracy with min loss (%) |
Random search | 57.18 (±0.96) | 57.90 |
Bayesian optimization | 56.28 (±1.68) | 57.19 |
CMA-ES | 57.17 (±0.80) | 58.19 |
Coordinate-search method | 55.06 (±2.31) | 56.98 |
Nelder-Mead method | 56.72 (±0.50) | 57.42 |
The coordinate-search method also achieves good results with LeNet and Batch-Normalized Maxout Network in Network. However, the coordinate-search method searches points using each vector of the positive basis; thus, convergence speed is reduced as the number of dimensions increases. This appears to be the reason why the coordinate-search method does not work for the age/gender classification CNN, which has more hyperparameters. Thus, we should use the Nelder-Mead method rather than the coordinate-search method. As demonstrated in the literature [9], CMA-ES is superior to random search because it finds better parameters earlier. Despite using the same hyperparameters for Bayesian optimization, the method works well for age estimation but not for other tasks. This indicates that, for Bayesian optimization, hyperparameters should be adjusted carefully depending on the given task.
Experimental results showed that the Nelder-Mead method converged to different points every time and the objective function was almost multimodal. Different hyperparameters settings achieved similar losses. From Table 13 and Fig. 13, we deduce that many local optima that achieve similar results exist. In such cases, the Nelder-Mead method tends to directly converge to a close local optimum without being influenced by the objective function values of distant points. In contrast, other methods perform a global search, e.g., Bayesian optimization and CMA-ES try to find potential candidates of global optima and require more iterations to find a local optimum in comparison to the Nelder-Mead method.
Poor hyperparameter setting detection rate for each method (LeNet)
Method | Detection rate |
---|---|
Random search | 0.144722 (±0.010690) |
Bayesian optimization | 0.066041 (±0.013354) |
Coordinate-search method | 0.080674 (±0.041299) |
CMA-ES | 0.111304 (±0.098500) |
Nelder-Mead method | 0.003151 (±0.003056) |
Poor hyperparameter setting detection rate for each method (Batch-Normalized Maxout Network in Network)
Method | Detection rate |
---|---|
Random search | 0 (±0) |
Bayesian optimization | 0.004943 (±0.001991) |
Coordinate-search method | 0.048941 (±0.054884) |
CMA-ES | 0 (±0) |
Nelder-Mead method | 0 (±0) |
Poor hyperparameter setting detection rate for each method (gender classification CNN)
Method | Detection rate |
---|---|
Random search | 0.506177 (±0.015931) |
Bayesian optimization | 0.445244 (±0.005658) |
Coordinate-search method | 0.274671 (±0.198506) |
CMA-ES | 0.360734 (±0.091648) |
Nelder-Mead method | 0.051413 (±0.006113) |
Poor hyperparameter setting detection rate for each method (age classification CNN)
Method | Detection rate |
---|---|
Random search | 0.444667 (±0.039355) |
Bayesian optimization | 0.355933 (±0.008577) |
Coordinate-search method | 0.147418 (±0.019866) |
CMA-ES | 0.317533 (±0.207479) |
Nelder-Mead method | 0.040334 (±0.004082) |
Note that the Nelder-Mead method rarely generates poor hyperparameter settings because of its strategy, e.g., reflection moves the simplex in a direction away from the point of poor hyperparameter settings.
From the above results, we conclude that the Nelder-Mead method is the best choice for DNN hyperparameter optimization.
6 Conclusions
In this study, we tested methods for DNN hyperparameter optimization. We showed that the Nelder-Mead method achieved good results in all experiments. Moreover, we achieved state-of-the-art accuracy with age/gender classification using the Adience DB by optimizing the CNN hyperparameters proposed in [26].
Complicated hyperparameter optimization methods are difficult to implement and have sensitive hyperparameters, which affects their performance. Therefore, it is difficult for non-experts to use these methods. In contrast, the Nelder-Mead method is easy to use and outperforms such complicated methods in many cases.
In our experiments, we optimized the hyperparameters of DNNs for character recognition and age/gender classification. These tasks are important and have been well known for a long time. However, it is desirable to evaluate the proposed method using the generic object recognition data set. Therefore, in future, we plan to evaluate the proposed method using other data sets. A detailed analysis of the dependency on initial parameters and the optimization of categorical variables will be also the focus of future work.
Declarations
Acknowledgements
This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
Authors’ contributions
YO implemented the hyperparameter optimization methods for DNNs, performed the experiments, and drafted the manuscript. MY implemented the hyperparameter optimization methods for DNNs and helped perform the experiments and draft the manuscript. MO guided the work, supervised the experimental design, and helped draft the manuscript. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- LeCun Y, Bottou L, Bengio Y, Patrick H (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324.View ArticleGoogle Scholar
- Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst (NIPS) 25:1097–1105.Google Scholar
- Christian S, Wei L, Yangqing J, Pierre S, Scott R, Dragomir A, Dumitru E, Vincent V, Andrew R (2015) Going deeper with convolutions. Comput Vis Pattern Recognit (CVPR):1–9. http://ieeexplore.ieee.org/document/7298594/.
- Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305.MATHMathSciNetGoogle Scholar
- Mockus J (1974) On Bayesian Methods for Seeking the Extremum In: Optimization Techniques IFIP Technical Conference, 400–404.Google Scholar
- Hansen N, Ostermeier A (2001) Completely derandomized self-adaptation in evolution strategies. Evol Comput 9:159–195.View ArticleGoogle Scholar
- Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. Adv Neural Inf Process Syst (NIPS) 24:2546–2554.Google Scholar
- Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. Adv Neural Inf Process Syst (NIPS) 25:2951–2959.Google Scholar
- Watanabe S, Le Roux J (2014) A black box optimization for automatic speech recognition In: International Conference on Acoustics, Speech, and Signal Processing, 3256–3260.. (ICASSP).Google Scholar
- Loshchilov I, Hutter F (2016) CMA-ES for hyperparameter optimization of deep neural networks. https://arxiv.org/abs/1604.07269. Accessed 20 Sept 2017.
- Brochu E, Cora VM, De Freitas N (2010) A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. https://arxiv.org/abs/1012.2599. Accessed 20 Sept 2017.
- Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R (2015) Scalable Bayesian optimization using deep neural networks In: International Conference on Machine Learning, 2171–2180.. (ICML).Google Scholar
- Hansen N, Auger A, Ros R, Finck S, Pošík P (2010) Comparing results of 31 algorithms from the black-box optimization benchmarking BBOB-2009 In: Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, 1689–1696.Google Scholar
- Conn AR, Scheinberg K, Vicente LN (2009) Introduction to derivative-free optimization. MPS-SIAM Ser Optim. http://epubs.siam.org/doi/book/10.1137/1.9780898718768.
- Nelder JA, Mead RA (1965) Simplex method for function minimization. Comput J 7:308–313.View ArticleMATHMathSciNetGoogle Scholar
- Gilles C, Patrick R, Mélanie H (2005) Model selection for support vector classifiers via direct simplex search. The Florida Artificial Intelligence Research Society (FLAIRS) Conference:431–435.Google Scholar
- Domhan T, Springenberg JT, Hutter F (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves In: Proceedings of the 24th International Joint Conference on Artificial Intelligence, 3460–3468.. (IJCAI).Google Scholar
- Klein A, Falkner S, Springenberg JT, Hutter F (2016) Bayesian neural networks for predicting learning curves. Workshop on Bayesian Deep Learning, NIPS. http://bayesiandeeplearning.org/2016/papers/BDL_38.pdf.
- Klein A, Falkner S, Springenberg JT, Hutter F (2017) Learning curve prediction with bayesian neural networks. International Conference on Learning Representations (ICLR). http://www.iclr.cc/doku.php?id=iclr2017:conference_posters#tuesday_morning.
- De Rainville FM, Fortin FA, Gardner MA, Parizeau M, Gagné C (2012) “DEAP: A Python Framework for Evolutionary Algorithms” In: EvoSoft Workshop, Companion proc. of the Genetic and Evolutionary Computation Conference (GECCO). https://dl.acm.org/citation.cfm?id=2330799.
- LeCun Y, Cortes C (2010) MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/. Accessed 20 Sept 2017.
- Yangqing J, Evan S, Jeff D, Sergey K, Jonathan L, Ross G, Sergio G, Trevor D (2014) Caffe: Convolutional architecture for fast feature embedding. https://arxiv.org/abs/1408.5093. Accessed 20 Sept 2017.
- Evan S (2014) Training LeNet on MNIST with Caffe. http://caffe.berkeleyvision.org/gathered/examples/mnist.html. Accessed 20 Sept 2017.
- Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. International Conference on Machine Learning. https://dl.acm.org/citation.cfm?id=3104425.
- Chang JR, Chen YS (2015) Batch-Normalized Maxout Network in Network In: Proceedings of the 33rd International Conference on Machine Learning. https://arxiv.org/abs/1511.02583. Accessed 20 Sept 2017.
- Gil L, Tal H (2015) Age and gender classification using convolutional neural networks. Computer Vision and Pattern Recognition Workshops (CVPRW). http://ieeexplore.ieee.org/document/7301352/.
- Gil L, Tal H (2015) Age and gender classification using convolutional neural networks. http://www.openu.ac.il/home/hassner/projects/cnn_agegender. Accessed 20 Sept 2017.
- Eran E, Roee E, Tal E (2014) Age and gender estimation of unfiltered faces. IEEE Trans Inf Forensic Secur 9(12):2170–2179.View ArticleGoogle Scholar
- Yangjin J (2013) The learning rate decay policy. https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto. Accessed 20 Sept 2017.