Extreme Gradient Boosting (xgboost) is a very fast, scalable implementation of gradient boosting that has taken the data science world by storm, with xgboost regularly online data science competitions and use at scale across different industries. Xgboost was originally developed by Tiangi Chen and is renowned for execution speed and model performance. I have recently been conducting some experiments with xgboost for the Renewables AI products. Below I show how to run a simple regression type tree and linear based model. Thereafter we go on to explore grid search and random search with xgboost.
But first a little background information…
Boosting is what gives xgboost it’s state of the art performance. Boosting is not a specific machine learning algorithm, but a concept that can be applied to set of machine learning algorithms, hence boosting is known as a meta algorithm. Essentially, xgboost is an ensemble method, used to convert many weak learners (models performing slightly better than chance) to a strong learner. This is achieved via boosting, where a set of weak learners on subsets of the data is iteratively learnt. Each weak learner is weighted according to performance. Thereafter, each weak learner’s predictions are combined and multiplied by their weight to obtain a final weighted prediction, which is better than any of the individual predictions themselves.
The Python API is capable of running the xgboost on regression and classification problems, using decision tree and linear learners. Below we apply xgboost to regression type problem using a tree-based learner. Decision trees are an iterative contruction of binary decisions (one decision at a time) until a stopping criterion is met (ie. The majority of one decision split consists of one category/value or another). Individual trees tend to overfit (low bias, high variance), hence perform well on training data but don’t generalise as well, hence ensemble methods are useful in this scenario. Notice as this is a regression type problem we use the loss function "reg:linear", whereas for a classification problem we would use "reg:logistic" or "binary: logistic" depending on whether you are interested in the class or the probability of the class. A loss function maps the difference between the actual and predicted values - we aim to find the model with the lowest loss function.
Like many other algorithms, performance can be enhanced by tuning the hyperparameters. Below shows an example of a xgboost grid search. Grid search can be quite computationally expensive as we exhaustively search over a given set of hyperparameters, and pick the best performing hyperparameters. For example, if we have 2 hyperparameters to tune and 4 possible values for each parameter, that’s 16 possible parameter configurations. An alterative to grid search is random search, where you can define how many models/iterations to try before stopping. During each iteration, the algorithm randomly selects a value in the range specified for each hyperparameter.