• Channels

  • Contact

  • Main Site

  • More

    To see this working, head to your live site.
    1. Elastacloud Channels
    2. Championing Data Science
    3. A peek into xgboost with Python
    Search
    darshna
    Jun 06, 2018
      ·  Edited: Jun 07, 2018

    A peek into xgboost with Python

    Extreme Gradient Boosting (xgboost) is a very fast, scalable implementation of gradient boosting that has taken the data science world by storm, with xgboost regularly online data science competitions and use at scale across different industries. Xgboost was originally developed by Tiangi Chen and is renowned for execution speed and model performance. I have recently been conducting some experiments with xgboost for the Renewables AI products. Below I show how to run a simple regression type tree and linear based model. Thereafter we go on to explore grid search and random search with xgboost.


    But first a little background information…


    Boosting is what gives xgboost it’s state of the art performance. Boosting is not a specific machine learning algorithm, but a concept that can be applied to set of machine learning algorithms, hence boosting is known as a meta algorithm. Essentially, xgboost is an ensemble method, used to convert many weak learners (models performing slightly better than chance) to a strong learner. This is achieved via boosting, where a set of weak learners on subsets of the data is iteratively learnt. Each weak learner is weighted according to performance. Thereafter, each weak learner’s predictions are combined and multiplied by their weight to obtain a final weighted prediction, which is better than any of the individual predictions themselves.


    The Python API is capable of running the xgboost on regression and classification problems, using decision tree and linear learners. Below we apply xgboost to regression type problem using a tree-based learner. Decision trees are an iterative contruction of binary decisions (one decision at a time) until a stopping criterion is met (ie. The majority of one decision split consists of one category/value or another). Individual trees tend to overfit (low bias, high variance), hence perform well on training data but don’t generalise as well, hence ensemble methods are useful in this scenario. Notice as this is a regression type problem we use the loss function "reg:linear", whereas for a classification problem we would use "reg:logistic" or "binary: logistic" depending on whether you are interested in the class or the probability of the class. A loss function maps the difference between the actual and predicted values - we aim to find the model with the lowest loss function.

    import numpy as np
    import pandas as pd
    from sklearn.metrics import r2_score
    import xgboost as xgb
    from sklearn.metrics import mean_squared_error
    from xgboost import plot_tree
    
    X_train = pd.read_csv("X_train.csv")
    Y_train = pd.read_csv("Y_train.csv")
    X_test = pd.read_csv("X_test.csv")
    Y_test = pd.read_csv("\Y_test.csv")
    
    list(X_train)
    list(X_test)
    
    ####################################
    # XGBoost Decision Tree
    ###################################
    xg_reg = xgb.XGBRegressor(objective='reg:linear', n_estimators=10, seed= 123)
    xg_reg.fit(X_train,Y_train)
    
    #XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
    #       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
    #       max_depth=3, min_child_weight=1, missing=None, n_estimators=10,
    #       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
    #       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,
    #       silent=True, subsample=1)
    
    preds = xg_reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(Y_test, preds))
    print("RMSE: %f" % (rmse))#RMSE: 164.866642
    
    r2 = r2_score(Y_test, preds) 
    
    # Plot the first tree
    xgb.plot_tree(xg_reg,num_trees=0)
    plt.show()

    In the following code we apply xgboost to a regression type problem using linear based learners.


    #####################################
    # XGBoost Linear Regression
    #####################################
    DM_train = xgb.DMatrix(data=X_train, label=Y_train)
    DM_test = xgb.DMatrix(data=X_test, label=Y_test)
    params = {"booster":"gblinear", "objective":"reg:linear"}
    xg_reg = xgb.train(params = params, dtrain=DM_train, num_boost_round=10)
    preds = xg_reg.predict(DM_test)
    rmse = np.sqrt(mean_squared_error(Y_test, preds))
    print("RMSE: %f" % (rmse)) #RMSE: 169.731848
    
    r2 = r2_score(Y_test, preds) 

    Like many other algorithms, performance can be enhanced by tuning the hyperparameters. Below shows an example of a xgboost grid search. Grid search can be quite computationally expensive as we exhaustively search over a given set of hyperparameters, and pick the best performing hyperparameters. For example, if we have 2 hyperparameters to tune and 4 possible values for each parameter, that’s 16 possible parameter configurations. An alterative to grid search is random search, where you can define how many models/iterations to try before stopping. During each iteration, the algorithm randomly selects a value in the range specified for each hyperparameter.

    import pandas as pd
    import xgboost as xgb
    import numpy as np
    from sklearn.model_selection import GridSeachCV
    housing_data = pd.read_csv("ames_housing_trimmed_processed.csv")
    X, y = housing_data[housing_data.columns_tolist()[:-1]],
    housing_data[housing_data.columns.tolist()[-1]]
    housing_dmatrix = xgb.DMatrix(data=X, label=y)
    gbm_param_grid = {'learning_rate':[0.01,0.1,0.5,0.9],
                      'n_estimators': [200],
                      'subsample': [0.3,0.5,0.9]}
    gbm = xgb.Regressor()
    grid_mse = GridSearchCV(estimator = gbm,
                            param_grid = gbm_param_grid,
                            scoring = 'neg_mean_squared_error', cv = 4, verbose = 1)
    grid_mse.fit(X,y)
    print("Best parameters found: ", grid_mse.best_params_)
    print("lowest RMSE: ", np.sqrt(np.abs(grid_mse.best_score_)))

    A quick overview of the hyperparameters that can be tuned for tree based models:

    - eta/learning rate (how quickly the model fits residual error using additional base lase learners

    -gamma; minimum loss reduction to create new tree split

    -lambda: L2 regularisation on lead weights

    -alpha: L1 regularisation on leaf weights

    -max depth: how big a tree can grow

    -subsample: Percentage of sample that can be used for any given boosting round

    -colsample_tree: the fraction of features that can be called on during any boosting round (ranges from 0-1)


    An overview of hyperparameters that can be tuned for linear learners:

    -lambda: L2 regularisation on weights

    -alpha: L1 regularisation on weights

    -lambda_bias: L2 regularisation term on bias


    Another useful blog posts related to xgboost can be found here.


    Happy experimenting!

    0 comments
    • Twitter Social Icon
    • LinkedIn Social Icon
    • Facebook Social Icon

    Visit the Elastacloud website