In Data Science, there are many algorithms available for use these days. One useful technique, therefore, is to use combine them in a single model to get the best out of each, resulting in a more accurate model.

Using Scikit-Learn, you will find Random Forest algorithm, that is the bagging kind of ensemble model. On the other hand, you will also find Boosting models, that train the estimators in sequence, where the result of one model is passed to the next one, that will try to improve the predictions, until they reach an optimal result.

When creating a Gradient Boosting estimator, you will find this hyperparameter `n_estimator=100`

with a default value of 100 trees to be created to get to a result. Many times, we just set this to the default or maybe increase as needed, even using Grid Search techniques.

In this post, we will find a simple way to get to a single number to use to train our model.

Gradient Boosting can be loaded from Scikit-Learn using this class

. The Gradient Boosting algorithm can be used either for classification or for Regression models. It is a Tree based estimator — meaning that it is composed of many decision trees.**from **sklearn.ensemble **import **GradientBoostingRegressor

The result of the *Tree 1* will generate errors. Those errors will be used and the input for the *Tree 2*. Once again, the errors of the last model will be used and the input of the next one, until it reaches the `n_estimators`

value.

Since each estimator will fit the error of the previous one, the expectation is that the combination of the predictions will be better than any of the estimator’s alone. After each iteration, we are making the model more complex, reducing bias but increasing the variance, on the flip side. So we must know when to stop.

Let’s see how to do that now.

The code for this exercise is simple. All we must do is a loop after each iteration and check at which one we had the lowest error.

Let’s begin by choosing a dataset. We will use the *car_crashes* dataset, native from the seaborn library (so an open data under BDS license).

`# Dataset`

df = sns.load_dataset('car_crashes')

Here’s a quick look at the data. We will try to estimate the `total`

amount using the other features as predictors. Since it’s a real number output, we’re talking about a regression model.

Quickly looking at the correlations.

`# Correlations`

df.corr().style.background_gradient(cmap='coolwarm')

Ok, no major multicollinearity. We can see that `ins_premium`

and `ins_losses`

don’t correlate very well with the `total`

, so we will not consider them in the model.

If we check the missing data, there are none

`# Missing`

df.isnull().sum()

0

Nice, so let’s split the data now.

# X and y

X = df.drop(['ins_premium', 'ins_losses', 'abbrev', 'total'], axis=1)

y = df['total']# Train test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)

We can create a pipeline to scale the data and model it (*it is really not very necessary to scale this data, since they’re in the same scale already, on the tens base*). Next, we fit the data to the model and predict the results.

I am using 500 estimators with a `learning_rate`

of 0.3.

The learning rate is the size of the step we take to get to the minimum error. If we use a value that is too high, we may pass the minimum. If we use a number that is too small, we may not even get close to it. So, a rule of thumb you can consider is: if you have a large number of estimators, you can use lower values of learning rate. If you have just a few estimators, prefer using higher values of learning rate.

steps = [('scale', StandardScaler()),

('GBR', GradientBoostingRegressor(n_estimators=500, learning_rate=0.03)) ]# Instance Pipeline and fit

pipe = Pipeline(steps).fit(X_train, y_train)# Predict

preds = pipe.predict(X_test)

Now, evaluating.

# RMSE of the predictions

print(f'RMSE: { round(np.sqrt(mean_squared_error(y_test, preds)),1 )}')[OUT]: RMSE: 1.1# Mean of the true Y values

print(f'Data y mean: {round( y.mean(),1 )}')[OUT]: Data y mean: 15.8

Good. Our RMSE is about 6.9% of the mean. So we’re off by this much, on average.

Now let’s check a way to tune our model by choosing the optimal number of estimators to train that will give us the lowest error rate.

Like I said, we don’t really have to scale this data because it is in the same proportion already. So let’s fit the model.

`#Model`

gbr = GradientBoostingRegressor(n_estimators=500, learning_rate=0.3).fit(X_train, y_train)

Now it is the good stuff. There is a method in Gradient Boosting that allows us to iterate over the predictions of each estimator trained, from 1 to 500. So, we will create a loop that goes through the 500 estimators in the `gbr`

model, predicts results using the method `staged_predict()`

, calculates the mean squared error and store the result in the list `errors`

.

# Loop for the best number

errors = [ mean_squared_error(y_test, preds) for preds in gbr.staged_predict(X_test)]# Optimal number of estimators

optimal_num_estimators = np.argmin(errors) + 1

Next, we can plot the result.

`#Plot`

g=sns.lineplot(x=range(500), y=errors)

g.set_title(f'Best number of estimators at {best_n_estimators}', size=15);

We see that the lowest error rate is with 34 estimators. So, let’s retrain our model with 34 estimators and compare with the result from the model trained with the pipeline.

# Retrain

gbr = GradientBoostingRegressor(n_estimators=34, learning_rate=0.3).fit(X_train, y_train)# Predictions

preds2 = gbr.predict(X_test)

Evaluating…

# RMSE of the predictions

print(f'RMSE: { round(np.sqrt(mean_squared_error(y_test, preds2)),1 )}')[OUT]: RMSE: 1.0# Data Y mean

print(f'Data y mean: {round( y.mean(),1 )}')[OUT]: Data y mean: 15.8

We went down from 6.9% to 6.3% off now. Approx. 9% better. Let’s look at a few predictions.

Interesting results. Some of the predictions of the second model are better than the first one.

We learned how to determine the best number of estimators to tweak a `GradientBoostingRegressor`

from Scikit-Learn. This is a hyperparameter that can make a difference in this kind of ensemble model, that trains estimators in sequence.

Sometimes, after a few iterations, the model can start to overfit, thus it will start to increase the variance too much, impacting the predictions.

We saw that a simple loop can help us to find the optimal solution in this case. But, certainly, for large datasets it can be expensive to calculate, so an idea would be try a lower `n_estimators`

at first and see if you can reach the minimum error soon enough.

Here’s the complete code in GitHub.

If you liked this content, follow my blog.

Find me on LinkedIn.

This exercise was based on the excellent text book by Aurélien Géron, in the reference.

How to choose the number of estimators for Gradient Boosting Republished from Source https://towardsdatascience.com/how-to-choose-the-number-of-estimators-for-gradient-boosting-8d06920ab891?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>