Effective hyperparameter optimization using Nelder-Mead method in deep learning

Ozaki, Yoshihiko; Yano, Masaki; Onishi, Masaki

doi:10.1186/s41074-017-0030-7

Research Paper
Open access
Published: 10 November 2017

Effective hyperparameter optimization using Nelder-Mead method in deep learning

Yoshihiko Ozaki^1,2,
Masaki Yano^1,2 &
Masaki Onishi^1,2

IPSJ Transactions on Computer Vision and Applications volume 9, Article number: 20 (2017) Cite this article

12k Accesses
33 Citations
2 Altmetric
Metrics details

Abstract

In deep learning, deep neural network (DNN) hyperparameters can severely affect network performance. Currently, such hyperparameters are frequently optimized by several methods, such as Bayesian optimization and the covariance matrix adaptation evolution strategy. However, it is difficult for non-experts to employ these methods. In this paper, we adapted the simpler coordinate-search and Nelder-Mead methods to optimize hyperparameters. Several hyperparameter optimization methods were compared by configuring DNNs for character recognition and age/gender classification. Numerical results demonstrated that the Nelder-Mead method outperforms the other methods and achieves state-of-the-art accuracy for age/gender classification.

1 Introduction

The evolution of deep neural networks (DNNs) has dramatically improved the accuracy of character recognition [1], object recognition [2, 3], and other tasks. However, the their increasing complexity increases the number of hyperparameters, which makes tuning of hyperparameters an intractable task.

Traditionally, DNN hyperparameters are adjusted using manual search, grid search, or random search [4]. However, search space expands exponentially relative to the number of hyperparameters; thus, such naive methods no longer work well. Therefore, more sophisticated hyperparameter optimization methods are required.

In deep learning, a hyperparameter optimization problem can be formulated as a stochastic black box optimization problem to minimize a noisy black box objective function f(x):

$$ \text{Minimize} \; f(\mathbf{x}) \quad (\mathbf{x} \in \chi). $$

(1)

Here, using all information available about the objective function, we can obtain its value at point x with noise ε as follows:

$$ y = f(\mathbf{x}) + \epsilon, \quad \epsilon \stackrel{iid}{\sim} \mathcal{N}\left(0, \sigma_{\text{noise}}^{2}\right). $$

(2)

This means that no analytical properties of the objective function, e.g., its derivatives, can be optimized. In addition, a loss function of the target DNN is typically chosen as f(x), and its evaluation cost is so expensive that training and testing of the DNN is required. The search space χ comprises combinations of multiple conditions such as real numbers, integers, and categories.

Currently, Bayesian optimization [5] and the covariance matrix adaptation evolution strategy (CMA-ES) [6] are considered the most promising methods for DNN hyperparameter optimization, and their optimization ability has been proven experimentally [7–10]. However, Bayesian optimization has some hyperparameters that significantly affect its optimization performance, e.g., choices of its kernel and acquisition function. Moreover, maximizing a non-convex acquisition function is required for each iteration of the optimization process. On the other hand, CMA-ES requires several populations and generations for sufficient performance. Although such calculations can be parallelized easily, significant computing resources are required.

It is evident that simple classical manual search, grid search, and random search remain common; thus, we consider that most people are unwilling to adjust the hyperparameters of a difficult optimization method or implement the method and do not have sufficient computing resources to optimize DNN hyperparameters.

In this paper, we describe simple substitutional methods, i.e., the coordinate-search and Nelder-Mead methods, for hyperparameter optimization in deep learning. To the best of our knowledge, no report has examined the application of these methods to hyperparameter optimization.

Our numerical results indicate that these methods are more efficient than other well-known methods. In particular, the Nelder-Mead method is the most effective for deep learning.

2 Related work

2.1 Random search

Random search is one of the simplest ways to optimize DNN hyperparameters. This method iteratively generates hyperparameter settings and evaluates the objective function. Random search has excellent parallelization and can handle integer and categorical hyperparameters naturally. Bergstra and Bengio demonstrated that random search outperforms a manual search by a human expert and grid search [4].

2.2 Bayesian optimization

Bayesian optimization is one of the most remarkable hyperparameter optimization methods in recent years. Its base concept was proposed in the 1970s; however, it has been significantly improved since then due to the attention paid to DNN hyperparameter optimization.

There are several variations of Bayesian optimization, e.g., Gaussian process (GP)-based variation [11], Tree-structured Parzen Estimator (TPE) [7], and Deep Networks for Global Optimization (DNGO) [12]. The most standard one is the GP-based variation.

GP-based Bayesian optimization is shown in Algorithm 1.

In this method, we assume that an objective function follows the GP specified by its mean function m and kernel k:

$$ f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')). $$

(3)

For simplicity, we assume m(x)=0. Then, we must consider the kernel k(x , x ^′). For the kernel, an automatic relevance determination (ARD) squared exponential (SE) kernel

$$ \begin{aligned} &K_{\text{SE}}(\mathbf{x}, \mathbf{x}') = \theta_{0} \exp \left(-\frac{1}{2}r^{2}(\mathbf{x}, \mathbf{x}') \right),\\ &\text{where}\ r^{2}(\mathbf{x}, \mathbf{x}') = \sum_{d = 1}^{D}\left(x_{d} - x_{d}'\right)^{2} / \theta_{d}^{2}, \end{aligned} $$

(4)

or ARD Matérn 5/2 kernel

$$ \begin{aligned} &K_{\text{M52}}(\mathbf{x}, \mathbf{x}')\\ &=\! \theta_{0} \!\! \left(\!\!1 \,+\, \sqrt{5r^{2}(\mathbf{x}, \mathbf{x}') \,+\, \frac{5}{3}r^{2}(\mathbf{x}, \mathbf{x}')}\right)\! \exp \! \left(\!\!-\sqrt{5r^{2}(\mathbf{x}, \mathbf{x}')}\right), \end{aligned} $$

(5)

is commonly used in Bayesian optimization [8]. Here, θ ₀, θ ₁, …, and θ _D are the kernel’s hyperparameters.

Once k(x,x ^′) is determined, we can predict information about a new sample point x _t+1 from previous observations $\mathcal {D}_{1:t} = \{\mathbf {x}_{1:t}, y_{1:t}\}$:

$$\begin{array}{@{}rcl@{}} &&P(y_{t+1} | \mathcal{D}_{1:t}, \mathbf{x}_{t+1}) \,=\, \mathcal{N}\!\left(\mu_{t}(\mathbf{x}_{t+1}), \sigma_{t}^{2}(\mathbf{x}_{t+1}) \,+\, \sigma^{2}_{\text{noise}}\right)\!, \end{array} $$

(6)

$$\begin{array}{@{}rcl@{}} &&\mu_{t}(\mathbf{x}_{t+1}) = \mathbf{k}^{T}\left[\mathbf{K} + \sigma^{2}_{\text{noise}}\mathbf{I}\right]^{-1}_{y_{1:t}}, \end{array} $$

(7)

$$\begin{array}{@{}rcl@{}} &&\sigma^{2}_{t}(\mathbf{x}_{t+1}) = k(\mathbf{x}_{t+1}, \mathbf{x}_{t+1}) - \mathbf{k}^{T}\left[\mathbf{K} + \sigma^{2}_{\text{noise}}\mathbf{I}\right]^{-1}\mathbf{k}\\ &&\text{where\ } \begin{aligned} \mathbf{K} = \left[ \begin{array}{ccc} k(\mathbf{x}_{1}, \mathbf{x}_{1}) & \cdots & k(\mathbf{x}_{1}, \mathbf{x}_{t})\\ \vdots & \ddots & \vdots \\ k(\mathbf{x}_{t}, \mathbf{x}_{1}) & \cdots & k(\mathbf{x}_{t}, \mathbf{x}_{t}) \end{array} \right] + \sigma^{2}_{\text{noise}}\mathbf{I},\\ \mathbf{k} = [k(\mathbf{x}_{t+1}, \mathbf{x}_{1}) \; k(\mathbf{x}_{t+1}, \mathbf{x}_{2}) \; \cdots \; k(\mathbf{x}_{t+1}, \mathbf{x}_{t})]. \end{aligned} \end{array} $$

(8)

The remaining problem is how to determine new sample points iteratively. To determine new candidates for a sample, we generally employ an acquisition function. Here, it is necessary to select an acquisition function that achieves a good balance between exploration and exploitation based on past observations. One well-known acquisition function is expected improvement (EI):

$$\begin{array}{@{}rcl@{}} &&\text{EI}(\mathbf{x}) \,=\, \left\{\begin{array}{ll} \!(\mu(\mathbf{x})\! - \!f(\mathbf{x^{+}})\!) \Phi(Z)\! + \!\sigma(\mathbf{x})\Phi(Z) &\!\!\!\!(\sigma(\mathbf{x}) \!>\! 0)\\ \!0 &\!\!\!\!(\sigma(\mathbf{x}) \,=\, 0) \end{array}\right.\\ &&\text{where}\ Z = \frac{\mu(\mathbf{x}) - f(\mathbf{x^{+}})}{\sigma(\mathbf{x})}, \; \mathbf{x}^{+} = \text{argmax}_{\mathbf{x}_{i} \in \mathbf{x}_{1:t}} f(\mathbf{x}_{i}). \end{array} $$

(9)

The point that maximizes the acquisition function becomes a new sample point. Although maximizing the non-convex acquisition function is difficult, the evaluation cost of the function is considerably less than that of the original objective function. Therefore, it is easier to handle than the original problem.

Practically, Bayesian optimization is combined with random search to collect initial observation data.

Bergstra et al. and Snoek et al. performed several computational experiments. The results demonstrated that Bayesian optimization outperforms manual search by a human expert and random search [7, 8].

2.3 Covariance matrix adaptation evolution strategy

While Bayesian optimization has been developed in the machine learning community, CMA-ES has been developed in the optimization community. CMA-ES is a type of evolutionary computation that demonstrates outstanding performance in benchmarks as a state-of-the-art black box optimization method [13].

The (μ _W, λ)-CMA-ES [6] is shown in Algorithm 2. It conducts a weighted recombination from the μ best out of λ individuals. The procedure is explained as follows.

(i)
Initialize the mean $\langle \mathbf {x} \rangle _{\mathrm {w}}^{(0)}$ and the standard deviation σ ⁽⁰⁾ of individuals. Set the evolution path $\mathbf {p}_{\mathrm {c}}^{(0)} = \mathbf {p}_{\sigma }^{(0)} = \mathbf {0}$ and the covariance matrix C ⁽⁰⁾=I.
(ii)
Generate g+1 generation individuals $\mathbf {x}_{k}^{(g+1)} \; (k = 1,\dots,\lambda)$:
$$\begin{array}{@{}rcl@{}} &&\mathbf{x}^{(g + 1)}_{k} = {\langle \mathbf{x} \rangle}_{\mathrm{w}}^{(g)} + \sigma^{(g)}\mathbf{B}^{(g)}\mathbf{D}^{(g)}\mathbf{z}_{k}^{(g + 1)},\\ &&\text{where\ } \begin{aligned} &{\langle \mathbf{x} \rangle}_{\mathrm{w}}^{(g)} := \frac{1}{\sum_{i = 1}^{\mu} w_{i}}\sum_{i = 1}^{\mu} w_{i}\mathbf{x}_{i:\lambda}^{(g)},\\ &\mathbf{B}^{(g)}\mathbf{D}^{(g)}\left(\mathbf{B}^{(g)}\mathbf{D}^{(g)}\right)^{T} = \mathbf{C}^{(g)},\\ &\mathbf{z}_{k}^{(g + 1)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \end{aligned} \end{array} $$
(10)

Here, w ₁,…,w _μ are weights, and i : λ denotes the ith best individual.
(iii)
Update the evolution path $\mathbf {p}^{(g)}_{\mathrm {c}}$ and the covariance matrix C ^(g):
$$\begin{array}{@{}rcl@{}} {}&&\mathbf{p}^{(g+1)}_{\mathrm{c}} = (1\! -\! c_{\mathrm{c}}) \mathbf{p}_{\mathrm{c}}^{(g)} \!+ c_{\mathrm{c}}^{\mathrm{u}} c_{\mathrm{w}} \mathbf{B}^{(g)}\mathbf{D}^{(g)} {\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g + 1)}\!, \end{array} $$
(11)

$$\begin{array}{@{}rcl@{}} {}&&\mathbf{C}^{(g+1)} = (1\! -\! c_{\text{cov}}) \mathbf{C}^{(g)} \!+ c_{\text{cov}} \mathbf{p}^{(g + 1)}_{\mathrm{c}}\left(\mathbf{p}_{\mathrm{c}}^{(g + 1)}\right)^{T}\!\!,\\ {}&&\text{where}\ \begin{aligned} &c_{\mathrm{c}}^{\mathrm{u}} := \sqrt{c_{\mathrm{c}}(2 - c_{\mathrm{c}})},\\ &c_{\mathrm{w}} := \frac{\sum_{i = 1}^{\mu} w_{i}}{\sqrt{\sum_{i = 1}^{\mu} w_{i}^{2}}},\\ &{\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g + 1)} := \frac{1}{\sum_{i = 1}^{\mu} w_{i}}\sum_{i = 1}^{\mu} w_{i} \mathbf{z}_{i : \lambda}^{(g+1)}. \end{aligned} \end{array} $$
(12)

Here, c _c and c _cov are hyperparameters.
(iv)
Update the evolution path $\mathbf {p}^{(g)}_{\sigma }$ and the step size σ ^(g):
$$\begin{array}{@{}rcl@{}} &&\mathbf{p}_{\sigma}^{(g + 1)} = (1 - c_{\mathrm{\sigma}}) \mathbf{p}_{\sigma}^{(g)} + c_{\mathrm{\sigma}}^{\mathrm{u}} c_{\mathrm{w}} \mathbf{B}^{(g)} {\langle \mathbf{z} \rangle}_{\mathrm{w}}^{(g+1)}, \end{array} $$
(13)

$$\begin{array}{@{}rcl@{}} &&\sigma^{(g + 1)} = \sigma^{(g)} \exp\left(\frac{1}{d_{\mathrm{\sigma}}} \frac{\|\mathbf{p}_{\mathrm{\sigma}}^{(g + 1)} - \hat{\mathbf{\chi}}_{\mathrm{n}}\|}{\hat{\mathbf{\chi}}_{\mathrm{n}}} \right),\\ &&\text{where\ } \begin{aligned} &c_{\mathrm{\sigma}}^{\mathrm{u}} := \sqrt{c_{\mathrm{\sigma}} (2 - c_{\mathrm{\sigma}})},\\ &\hat{\mathbf{\chi}}_{\mathrm{n}} = \mathbb{E}[\|\mathcal{N}(\mathbf{0}, \mathbf{I})\|]. \end{aligned} \end{array} $$
(14)

Here, c _σ and d _σ are hyperparameters.

Details about the hyperparameters used to update CMA-ES parameters are provided in the literature [6].

Since evaluations of each individual for each generation can be calculated simultaneously, CMA-ES can be parallelized easily. Watanabe and Le Roux and Loshchilov et al. demonstrated that CMA-ES outperforms manual search by a human expert and Bayesian optimization in certain cases [9, 10].

3 Coordinate-search and Nelder-Mead methods

In the previous section, we introduced the random search, Bayesian optimization, and CMA-ES methods. Note that the achievements of these methods have already been proven experimentally, and the results indicate that the latter two methods are very promising and considered superior to random search. However, both Bayesian optimization and CMA-ES have many hyperparameters related to their optimization performance. To set these hyperparameters appropriately, it is necessary to have sufficient knowledge about the given method. In addition, Bayesian optimization must maximize its non-convex acquisition function, and CMA-ES requires significant computing resources to exploit its advantages. These factors make it difficult for non-experts to utilize these methods.

In this section, we introduce two optimization methods, coordinate-search and Nelder-Mead, that are easy to implement. These methods have fewer hyperparameters to adjust and are practically usable with fewer computing resources.

3.1 Mathematical concepts

Before introducing the methods, we define some required mathematical concepts here.

Definition 1

The positive span of a set of vectors $[\mathbf {v}_{1} \cdots \mathbf {v}_{r}] \in \mathbb {R}^{n}$ is the convex cone:

$$ \{\mathbf{v} \in \mathbb{R}^{n} : \mathbf{v} = \alpha_{1} \mathbf{v}_{1} + \cdots + \alpha_{r} \mathbf{v}_{r}, \; \alpha_{i} \geq 0, \; i = 1, \ldots, r\}. $$

Definition 2

A positive spanning set in $\mathbb {R}^{n}$ is a set of vectors whose positive span is $\mathbb {R}^{n}$.

Definition 3

The set $[\!\mathbf {v}_{1} \cdots \mathbf {v}_{r}] \in \mathbb {R}^{n}$ is considered positively dependent if one of the vectors is in the positive span of the remaining vectors; otherwise, the set is considered positively independent.

Definition 4

A positive basis for $\mathbb {R}^{n}$ is a positively independent set whose positive span is $\mathbb {R}^{n}$. A positive basis for $\mathbb {R}^{n}$ that has n+1 vectors is considered a minimal positive basis and a positive basis that has 2n vectors is considered a maximal positive basis (Fig. 1). Here, a maximal positive basis is denoted as D _⊕.

Definition 5

A simplex of dimension m is a convex hull of an affinely independent set of points Y={y ⁰,y ¹,…,y ^m}.

3.2 Coordinate-search method

The coordinate-search method [14] (Algorithm 3, Fig. 2) is one of the simplest direct search methods. It minimizes its objective function iteratively using the maximal positive basis D _⊕=[I −I]=[e ₁ ⋯ e _n −e ₁ ⋯ −e _n].

This method performs a poll step iteratively to search a better solution and updates parameters to adjust its learning rate.

(i)
Poll step: Order the poll set P _k={x _k+α _k d:d∈D _⊕}. Evaluate f at the poll points in order. If a poll point x _k+α _k d _k that satisfies the condition f(x _k+α _k d _k)<f(x _k) is found, then stop polling, set x _k+1=x _k+α _k d _k, and declare the iteration and poll step successful.
(ii)
Parameter update: If iteration k succeeds, then set α _{k + 1} = α _k (or α _k+1 = 2α _k). Otherwise, set α _{k + 1} =α _k/2.

When the step size becomes sufficiently small, the search is terminated. Note that the evaluation of functions in the poll step can be parallelized.

This method deteriorates the performance for search ranges with different scales; thus, in this study, we normalize parameters to [ 0,1] in our computational experiments. In addition, we adopt the updating rule α _k+1=2α _k on iteration success and order the vectors of the poll set randomly for each iteration.

3.3 Nelder-Mead method

The Nelder-Mead method [14, 15] (Algorithm 4, Fig. 3) is an optimization method that uses a simplex proposed by Nelder and Mead. Gilles et al. applied this method for the hyperparameter tuning problem in support vector machine modeling. They demonstrated that the method can find very good hyperparameter settings reliably for support vector machines [16]. Currently, the Nelder-Mead method is not considered in DNN research; however, it has a long history and many achievements in other research areas [14]. Thus, we think it is worth considering the Nelder-Mead method for DNN hyperparameter optimization. In the study by Gilles et al., their SVM has only two hyperparameters. On the other hand, DNNs often have more than 10 times number of hyperparameters. So, our task is more challenging.

The Nelder-Mead method minimizes the objective function by repeating its evaluation at each vertex of the simplex and by replacing points according to the following procedure (Figs. 4 and 5).

(i)
Order: Order the n+1 vertices Y={y ⁰,y ¹,…,y ⁿ} as follows:
$$ f^{0} = f(\mathbf{y}^{0}) \leq f^{1} = f(\mathbf{y}^{1}) \leq \cdots \leq f^{n} = f(\mathbf{y}^{n}). $$
(ii)
Reflect: Reflect the worst vertex y ⁿ over the centroid $\mathbf {y}^{c} = \sum _{i = 0}^{n - 1}\mathbf {y}^{i} / n$ of the remaining n vertices:
$$ \mathbf{y}^{r} = \mathbf{y}^{c} + \delta^{r}(\mathbf{y}^{c} - \mathbf{y}^{n}). $$
Evaluate f ^r=f(y ^r). If f ⁰≤f ^r<f ⁿ⁻¹, then replace y ⁿ with the reflected point y ^r and terminate iteration k: Y _k+1={y ⁰,y ¹,…,y ⁿ⁻¹,y ^r}.
(iii)
Expand: If f ^r<f ⁰, calculate:
$$ \mathbf{y}^{e} = \mathbf{y}^{c} + \delta^{e} (\mathbf{y}^{c} - \mathbf{y}^{n}) $$
and evaluate f ^e=f(y ^e). If f ^e≤f ^r, then replace y ⁿ with the expansion point y ^e and terminate iteration k: Y _k+1={y ⁰,y ¹,…,y ⁿ⁻¹,y ^e}. Otherwise, replace y ⁿ with the reflected point y ^r and terminate iteration k: Y _k+1={y ⁰,y ¹,…,y ⁿ⁻¹,y ^r}.
(iv)
Contract: If f ^r≥f ⁿ⁻¹, then a contraction is performed between the best of y ^r and y ⁿ.
1. (a)
  Outside contraction: If f ^r<f ⁿ, perform an outside contraction:
  $$ \mathbf{y}^{oc} = \mathbf{y}^{c} + \delta^{oc}(\mathbf{y}^{c} - \mathbf{y}^{n}) $$
  and evaluate f ^oc=f(y ^oc). If f ^oc≤f ^r, then replace y ⁿ with the outside contraction point $\mathbf {y}^{oc}_{k}$ and terminate iteration k: Y _k+1={y ⁰,y ¹,…,y ⁿ⁻¹,y ^oc}. Otherwise, perform a shrink.
2. (b)
  Inside contraction: If f ^r≥f ⁿ, perform an inside contraction:
  $$ \mathbf{y}^{ic} = \mathbf{y}^{c} + \delta^{ic}(\mathbf{y}^{c} - \mathbf{y}^{n}) $$
  and evaluate f ^ic=f(y ^ic). If f ^ic<f ⁿ, then replace y ⁿ with the inside contraction point y ^ic and terminate iteration k: Y _k+1={y ⁰,y ¹,…,y ⁿ⁻¹,y ^ic}. Otherwise, perform a shrink.
(v)
Shrink: Evaluate f at the n points y ⁰+γ ^s(y ⁱ−y ⁰),where i=1,…,n, replace y ¹,…,y ⁿ with these points, and terminate iteration k: Y _k+1={y ⁰+γ ^s(y ⁱ−y ⁰), i=0,…,n}.

Here, γ ^s, δ ^ic, δ ^oc, δ ^r, and δ ^e are constant hyperparameters usually taking the following values:

$$ \gamma^{s} = \frac{1}{2}, \delta^{ic} = -\frac{1}{2}, \delta^{oc} = \frac{1}{2}, \delta^{r} = 1\ \text{and}\ \delta^{e} = 2. $$

(15)

Note that each step of an iteration, e.g., initialization and shrink operations, can be parallelized easily.

4 Poor hyperparameter setting detection

DNNs are very sensitive to hyperparameter settings. As a result, training can fail simply because some hyperparameters, e.g., the learning rate, are slightly inappropriate. When appropriate hyperparameter values are given, training loss is reduced in each iteration (Fig. 6, top graph); otherwise, regardless of how many iterations have been executed, training loss is not reduced (Fig. 6, bottom graph).

The advantage of human experts is that they can detect training failures and terminate them at an early stage. Domhan et al. proposed a method that accelerates hyperparameter optimization methods by detecting and terminating such training failures using learning curve prediction [17]. In addition, Klein et al. proposed a specialized Bayesian neural network to model DNN learning curves [18, 19]. We apply Algorithm 5 to detect training failures at an early stage.

Note that this method does not optimize hyperparameters directly, but accelerates a hyperparameter optimization method. If a large number of training iterations with poor hyperparameter settings appear in the optimization process, this detection process improves the execution time of the optimization method.

In our experiments, we apply this method to all hyperparameter optimization methods with n equaling 10% of the training iterations and t equaling 0.8. These values are chosen based on experience. As can be seen in Fig. 6, the learning curve of poor hyperparameter settings is distinctive and easy to detect; thus, there is no need to be too careful to decide n and t.

5 Numerical results

We perform computational experiments to optimize real and integer hyperparameters in combination with various datasets, tasks, and convolutional neural networks (CNNs) to compare the performance of the random search, Bayesian optimization, CMA-ES, coordinate-search, and Nelder-Mead methods.

The experimental settings for each method are given in Table 1. We use the first 100 random search evaluations to initialize the Bayesian optimization and coordinate-search methods. The number of evaluations and initialization parameters of CMA-ES and Bayesian optimization are determined with reference to the literature [10]. We implement CMA-ES using Distributed Evolutionary Algorithms in Python (DEAP) [20], which is an evolutionary computation framework. In addition, for optimization methods that cannot handle integer values directly, integer hyperparameters are handled as continuous values and rounding is performed when evaluating the objective function.

Table 1 Experimental setting for each method

Effective hyperparameter optimization using Nelder-Mead method in deep learning

Abstract

1 Introduction

2 Related work

2.1 Random search

2.2 Bayesian optimization

2.3 Covariance matrix adaptation evolution strategy

3 Coordinate-search and Nelder-Mead methods

3.1 Mathematical concepts

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

3.2 Coordinate-search method

3.3 Nelder-Mead method

4 Poor hyperparameter setting detection

5 Numerical results

5.1 MNIST

5.2 Age and gender classification

5.3 Results

6 Conclusions

References

Acknowledgements

Authors’ contributions

Competing interests

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords