Hyperparameter Tuning
Hyperparameter tuning refers to the process of choosing a set of optimal parameters for model parameters that are not updated during the learning. The hyperparameters are used to configure the model (e.g., L1/L2 tuning parameters, or number of decision trees, drop-out rate) or algorithm used for optimization and cost function (e.g., learning rate in gradient descent or batch size in stochastic gradient descent). One can find the optimum values for the hyperparameters manually by trying different values and select the best one (also known as hand-tuning) or can systematically search the parameters’ space. There are several strategies for optimizing hyperparameters. Here we briefly discuss the most well-known options:
Grid search works by search exhaustively through a specified subset of hyperparameters. The search space of each hyperparameter is discretized, and the model is trained for all possible parameter combinations. The hyperparameters that provide the best performance are selected.
Random search is a variation of the grid search algorithm, but instead of trying all possible combinations it randomly selects subset of the parameters. This approach decreases the processing time but does not fully cover the parameter space, in particular when there are not enough data points and when the selected points are close to each other. The latter problem can be fixed if the researcher selects the random sample employing quasi-random sequences, such as the Halton sequence.
None of the above approaches guaranteed to converge to an optimal solution with a certain precision, except if the search space is thoroughly sampled.
Sequential Model-based Optimization is a probabilistic model-based approach, implemented by building a probability model of the objective function that maps input values to a probability of a loss. In this type of model, the next values are selected by the algorithm optimizing the probability model (also known as surrogate function) that yields the greatest expected improvement. The surrogate is much easier to optimize compared to the true objective function. After defining a proper surrogate function, Bayesian optimization is used to find the best hyperparameters. This approach is sequential as it runs trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating the surrogate function. Several common choices for surrogate models are Gaussian process, Random Forest Regressions, and Tree-structured Parzen Estimators (TPE). Sequential model-based optimization methods are faster and more efficient than manual, random or grid search (Bergstra, Yamins, Cox, 2013; Snoek, Larochelle, and Adams, 2012).
Population-based method, such as genetic algorithms, and evolutionary algorithms, is inspired by the theory of evolution by natural selection. The population-based methods maintain a population (i.e., a set of configuration) and improve this population by applying local perturbations (i.e., mutations) and combinations of different members to obtain new generation of better configurations. The application of this approach is more common for hyperparameter optimization of probabilistic machine learning, auto-ml, and deep neural network architecture search ().
Gradient-based optimization is another approach that supports a few families of machine learning algorithms, such as least square support vector machines and neural networks. Gradient-based optimization finds the optimal solution based on the gradient of the model selection criterion with respect to model hyperparameters.
To put the human out of the loop of ML design, researchers developed auto-ml models, in which the type of ML model, as well as the features which are fed into the algorithm, can be determined by the machine automatically. This approach is, in particular, important as the range of hyperparameter choices has expanded significantly, in particular in estimation of deep neural network models.
Last updated