How to automate model tuning with HyperOpt
Do you love tuning models? If your answer is “yes”, this post is not for you.
In this blog we will cover the extremely popular automated hyperparameter tuning algorithm called Tree-based Parzen Estimators (TPE). TPE is supported by the open-source package, HyperOpt. By leveraging HyperOpt and TPE, machine learning engineers can quickly develop highly-optimized models without any manual tuning.
Without further ado, let’s dive in!
HyperOpt is an open-source python package that uses an algorithm called Tree-based Parzen Esimtors (TPE) to select model hyperparameters which optimize a user-defined objective function. By simply defining the functional form and bounds of each hyperparameter, TPE thoroughly yet efficiently searches through complex hyperspace to reach optimums.
TPE is a sequential algorithm that leverages bayesian updating and follows the below sequence.
- Train a model with several sets of randomly-selected hyperparameters, returning objective function values.
- Split our observed objective function values into “good” and “bad” groups, according to some threshold gamma (γ).
- Calculate the “promisingness” score, which is just P(x|good) / P(x|bad).
- Determine the hyperparameters that maximize promisingness via mixture models.
- Fit our model using the hyperparameters from step 4.
- Repeat steps 2–5 until a stopping criteria.
Here’s a quick code example.
Ok that was a lot of big words. Let’s slow down and really understand what’s going on.
1.1 — Our Goal
Data scientists are busy. We want to produce really good models, but do so in an efficient and ideally hands-off manner.
However, certain steps in the ML modeling lifecycle are very hard to automate. Exploratory data analysis (EDA) and feature engineering, for instance, are usually subject-specific and require human intuition. Model tuning, on the other hand, is an iterative process where computers can excel.
Our goal throughout this post is to understand how to leverage algorithms to automate the model tuning process.
To help us think about that goal, let’s use an analogy: we’re pirates looking for buried treasure. It’s also important to note that we’re very efficient pirates who are looking to minimize our time searching for the buried treasure. So, how should we minimize time spent searching? The answer is use a map!
In figure 1, we have a fictitious map that shows where our treasure is located. After lots of climbing and digging, it wouldn’t be too hard to reach that treasure because we know exactly where it’s located.
But what happens when we don’t have a map?
When tasked with tuning a model, we unfortunately are not given a map. Our terrain, which corresponds to the hyperparmeter search space, is unknown. Furthermore, the location of our treasure, which corresponds to the optimal set of hyperparameters, is also unknown.
With that setup, let’s talk about some potential ways to efficiently explore this space and find some treasure!
1.2 — Potential Solutions
The original method for model tuning is “manual” — the engineer will actually manually test many different configurations and see which hyperparameter combination produces the best model. While informative, this process is inefficient. There must be a better way…
1.2.1 — Grid Search (worst)
Our first optimization algorithm is grid search. Grid search iteratively tests all possible combinations of hyperparameters within a user-specified grid.
For instance, in figure 2, wherever you see a red dot is where we will retrain and evaluate our model. This framework is inefficient because it reuses bad hyperparameters. For instance, if hyperparameter 2 has little impact on our objective function, we will still test all combinations of its values, thereby increasing the required number of iterations by 10x (in this example).
But before moving on, it’s important to note that grid search is still fairly popular because it’s guaranteed to find an optimum given a correctly specified grid. If you do decide to use the method, make sure you transform your grid to reflect the functional form of your hyperparameters. For instance, max_depth for a random forest classifier is an integer — don’t let it search over a continuous space. It’s also unlikely that it has a uniform distribution — if you know the functional form of your hyperparameter, transform the grid to reflect it.
In summary, grid search is subject to the curse of dimensionality and recalculates information between evaluations, but is still used widely.
1.2.2 — Random Search (good)
Our second algorithm is random search. Random search tries random values within a user-specified grid. Unlike with grid search, we aren’t relegated to testing every possible combination of hyperparameters, which increases efficiency.
Here’s a cool fact: random search will find (on average) a top 5% hyperparameter configuration within 60 iterations. That said, as with grid search you must transform your search space to reflect the functional form of each hyperparam.
Random search is a good baseline for hyperparameter optimization.
1.2.3 — Bayesian Optimization (better)
Our third candidate is our first Sequential Model-Based Optimization (SMBO) algorithm. The key conceptual difference from the prior techniques is we iteratively use prior runs to determine future points of exploration.
Bayesian hyperparameter optimization looks to develop a probabilistic distribution of our hyperparameter search space. From there, it uses an acquisition function, such as expected expected improvement, to transform our hyperspace to be more “searchable.” Finally, it uses an optimization algorithm, such as stochastic gradient descent, to find a the hyperparameters that maximize our acquisition function. Those hyperparameters are used to fit our model and the process is repeated until convergence.
Bayesian optimization typically outperforms random search, however it has some core limitations such as requiring numeric hyperparameters.
1.2.4 — Tree-Based Parzen Estimators (best)
Finally, let’s talk about the star of the show: Tree-Based Parzen Estimators (TPE). TPE is another SMBO algorithm that typically outperforms basic bayesian optimization, but the main selling point is it handles complex hyperparameter relationships via a tree structure.
Let’s use figure 5 to understand this tree structure. Here we’re training a Support Vector Machine (SVM) classifier. We will test two kernels:
linear kernel does not take a width parameter but
RBF does, so by using a nested dictionary we’re able to encode this structure and thereby limit the search space.
TPE also support categorical variables which traditional Bayesian optimization does not.
Quick disclaimer before moving on, there are many other hyperparameter tuning packages. Each support a variety of algorithms, some of which include random forest, gaussian processes, and genetic algorithms. TPE is a very popular and general-purpose algorithm, but is not necessarily the best.
In general, TPE is a really robust and efficient hyperparameter optimization solution.
Now that we have a general understanding of some popular hyperparameter optimization algorithms, let’s do a deep dive into how TPE works.
Returning to our analogy, we’re pirates looking for buried treasure but don’t have a map. Our captain needs the treasure ASAP, so we need to dig in strategic locations that have a high probability of having treasure, using prior digs to determine the location of future digs.
2.1 — Initialization
To start, we define the constraints on our space. As noted above, our hyperparameters often have a functional form, max/min values, and hierarchical relation to other hyperparameters. Using our knowledge about our ML algorithms and our data, we can define our search space.
Next, we must define our objective function, which is used to evaluate how “good” our hyperparameter combination is. Some examples include classic ML loss functions, such as RMSE or AUC.
Great! Now that we have a bounded search space and a way to measure success, we‘re ready to start searching…
2.2 — Iterative Bayesian Optimization
Bayesian optimization is a sequential algorithm that finds points in hyperspace with a high probability of being “successful” according to an objective function. TPE leverages bayesian optimization but uses some clever tricks to improve performance and handle search space complexity…
2.2.0 — The Conceptual Setup
The first trick is modeling P(x|y) instead of P(y|x)…
Bayesian optimization typically looks to model P(y|x), which is the probability of an objective function value (y), given hyperparameters (x). TPE does the opposite — it looks to model P(x|y), which is the probability of the hyperparameters (x), given the objective function value (y).
In short, TPE tries to find the best objective function values, then determine the associated hyperparameters.
With that very important setup, let’s get into the actual algorithm.
2.2.1 — Split our Data into “Good” and “Bad” Groups
Remember, our goal is to find the best hyperparameter values according to some objective function. So, how can we leverage P(x|y) to do that?
First, TPE splits our observed data points into two groups: good, denoted g(x), and bad, denoted l(x). The cutoff between good and bad is determined by a user-defined parameter gamma (γ), which corresponds to the objective function percentile that splits our observations (y*).
So, with γ = 0.5, our objective function value that splits our observations (y*) will be the median of our observed points.
As shown in figure 7, we can formalize p(x|y) using the above framework. And, to roll with the pirate analogy…
Pirate Perspective: looking at the places we’ve already explored, l(x) lists places with very little treasure and g(x) lists places with lots of treasure.
2.2.32— Calculate the “Promisingness” Score
Second, TPE defines how we should evaluate an unobserved hyperparameter combination via the “promisingness” score.
Figure 8 defines our promisingness score (P), which is just a ratio with the following components…
- Numerator: the probability of observing a set of hyperparameters (x), given the corresponding objective function value is “good.”
- Denominator: the probability of observing a set of hyperparameters (x), given the corresponding objective function value is “bad.”
The bigger the “promisingness” value, the more likely that our hyperparameters x will produce a “good” objective function.
Pirate Perspective: promisingness shows how likely a given location in our terrain will be to have lots of treasure.
Quick aside before moving on, if you’re familiar with Bayesian optimization, this equation acts as an acquisition function and is proportional to the Expected Improvement (EI).
2.2.3— Create a Probability Density Estimates
Third, TPE looks to evaluate the “promisingness” score via mixture models. The idea of mixture models is to take multiple probability distributions and put them together using a linear combination — src. These combined probability distributions are then used to develop probability density estimates.
Generally, the mixture modeling process is…
- Define the distribution type of our points. In our case, if our variable is categorical we use a re-weighted categorical distribution and if its numeric we use a gaussian (i.e. normal) or uniform distribution.
- Iterate over each point and insert a distribution at that point.
- Sum the mass of all distributions to get a probability density estimate.
Note that this process is run individually for both sets l(x) and g(x).
Let’s walk through an example in figure 9…
For each observation (blue dots on the x-axis), we create a normal distribution ~N(μ, σ), where…
- μ (mu) is the mean of our normal distribution. It’s value is the location of our point along the x-axis.
- σ (sigma) is the standard deviation of our normal distribution. Its value is the distance to the closest neighboring point.
If points are close together, the standard deviation will be small and thereby the distribution will be very tall and conversely, if points are spread apart, the distribution will be flat (figure 10)…
Pirate Perspective: NA — pirates aren’t great with mixture models.
Another quick aside before moving on: if you’re reading the literature you’ll notice that TPE uses “truncated” gaussians, which simply means the gaussians are bounded by the range we specify in our hyperparameter configuration instead of extending to +/- infinity.
2.2.4 — Determining the Next Point to Explore!
Let’s bring these pieces together. So far, we’ve 1) acquired objective function observations, 2) defined our “promisingness” formula, and 3) created a probability density estimate via mixture models based on prior values. We have all the pieces to evaluate a given point!
Our first step is to create an average probability density function (PDF) for both g(x) and l(x).
An example process is shown in figure 11 — the red line is our average PDF and is simply the sum of all PDFs divided by the number of PDFs.
Using the average PDF, we can get the probability of any hyperparameter value (x) being in g(x) or l(x).
For instance, let’s say observed values in figure 11 belong to the “good” set, g(x). Based on our average PDF, it’s unlikely that a hyperparameter value of 3.9 or 0.05 belong to the “good” set. Conversely, a hyperparameter value of ~1.2 seems to be very likely to belong to the “good” set.
Now this is just one half of the picture. We apply the same methodology for the “bad” set, l(x). Since we’re looking to maximize g(x) / l(x), promising points should be located where g(x) is high and l(x) is low.
Pretty cool, right?
With these probability distributions, we can sample from our tree-structured hyperparameters and find the set of hyperparameters that maximize “promisingness” and are thereby worth exploring.
Pirate Perspective: the next location we dig is the location that maximizes the (probability of having lots of treasure) / (probability of having little treasure).
Now that you know how it works, here are some practical tips for implementing TPE via the open-source package, HyperOpt.
3.1 — Structure of a HyperOpt App
Generally, there are three main steps when leveraging HyperOpt…
- Define the search space, which is just the ranges and functional forms of the hyperparameters you’re looking to optimize.
- Define the fitting function, which calls your
model.fit()function on a given train/test split.
- Define the objective function, which is any of the classic accuracy metrics, such as RMSE or AUC.
Unfortunately, these automated tuning methods still require design input from the data scientist — it’s not a completely free lunch. However, anecdotally TPE is pretty robust to hyperparameter misspecification (within reason).
3.2— Tips and Tricks
- HyperOpt is parallelizable via both Apache Spark and MongoDB. If you’re working with multiple cores, wether it be in the cloud or on your local machine, this can dramatically reduce runtime.
- If you’re parallelizing the tuning process via Apache Spark, use a
SparkTrialsobject for single node ML models (sklearn) and a
Trailsobject for parallelized ML models (MLlib). Code is below.
- MLflow is an open source method for tracking your model runs. It easily integrates with HyperOpt.
- Don’t narrow down the search space too early. Some combinations of hyperparameters may be surprisingly effective.
- Defining the search space can be tricky, especially if you don’t know the functional form of your hyperparameters. However, from personal experience TPE is pretty robust to misspecification of those functional forms.
- Choosing a good objective function goes a long way. In most cases, error is not created equal. If a certain type of error is more problematic, make sure to build that logic into to your function.
3.3— A Code Example
Some nice features of this snippet include parallelization via Apache Spark and model logging via MLflow. Also note that this snippet optimizes an sklearn RandomForestRegressor — you’ll have to change the model and fitting function to suite your needs.
And there you have it — HyperOpt in all it’s glory!
To hammer hope the key points, let’s quickly recap.
Hyperparameter tuning is a necessary part of the ML model lifecycle, but is time consuming. Sequential Model-Based Optimization (SMBO) algorithms excel at searching complex hyperspaces for optimums, and they can be applied to hyperparameter tuning. Tree-based Parzen Estimators (TPE) is a very efficient SMBO and outperforms both Bayesian Optimization and Random Search.
TPE repeats the below steps until a stopping criteria:
- Divide observed points into “good” and “bad” sets, according to some hyperparameter, gamma.
- Fit a mixture model to both the “good” and “bad” set to develop an average probability density estimate.
- Select the point that optimizes the “promisingness” score, which leverages step 2 to estimate the probability of being in the “good” and “bad” sets.
Finally, we have a really cool code snippet that shows how to parallelize HyperOpt via SparkTrials. It also logs all of our iterations to MLflow.
HyperOpt Demystified Republished from Source https://towardsdatascience.com/hyperopt-demystified-3e14006eb6fa?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed