Boosting Algorithm (AdaBoost and XGBoost)
Ajeng Prastiwi, David Tahi Ulubalang

22 minute read

Introduction

What is boosting?

Boosting is an ensemble method of converting weak learners into strong learners. Weak and strong refer to a measure how correlated are the learners to the actual target variable[^1]. In boosting, each training sample are used to train one unit of decision tree and picked with replacement over-weighted data. The trees will learn from predecessors and updates the residuals error.

Learning Objectives

The goal of this article is to help you:

  • Understand the concept of boosting
  • Compare boosting and bagging method
  • Understand how AdaBoost algorithm works
  • Understand how XGBoost algorithm works
  • Implement AdaBoost and Xgboost in business case

Library and setup

library(tidyverse)
library(rsample)
library(xgboost)
library(ggthemes)
library(tictoc)
library(fastAdaboost)
library(tidymodels)
library(inspectdf)
library(caret)
theme_set(theme_pander())

Bagging vs Boosting

The idea of bagging is creating many subsets of training sample with replacement, each observation has the same probability to picked as sample. Then, each training sample are used to train one unit of decision tree and use the average of all the predictions. In boosting, each training sample are used to train one unit of decision tree and picked with replacement over-weighted data. The trees will learn from predecessors and updates the residuals error. After these weak learners are trained, a weighted average of their estimates are taken for the final predictions at the end[^2].

Boosting Method

The different method of boosting algorithm are “How they create the weak learners during the iterative process”:

AdaBoost

Adaptive boosting was formulated by Yoav Freund and Robet Schapire. AdaBoost was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields. AdaBoost algorithm works on changes the sample distribution by modifying weight data points for each iteration.

How AdaBoost Works?

We can split the idea of AdaBoost into 3 big concept :

1. Used Stump as Weak Learners

Weak learners is any model that has a accuracy better than random guessing even if it is just slightly better (e.g 0.51). In an Ensemble methods we combines multiple weak learners to make strong learner model. In AdaBoost, weak learners are used, a 1-level decision tree (Stump).The main idea when creating a weak classifier is to find the best stump that can separate data by minimizing overall errors.

2. Influence the Next Stump

Unlike bagging, which makes models in parallel, Boosting does training sequentially, which means that each stump (weak learner) is affected by the previous stump. The way Stump affects the next stump is by giving different weights to the data that will be used in the next stump maknig process. This weighting is based on error calculations, if a data is incorrectly predicted in the first stump, then the data will be given a greater weight in the next stump-making process.

3. Weighted Vote

In AdaBoost algorithm, each stump has a different weight, the weight for each stump is based on the resulting error rate. The smaller errors generated by a stump, the greater the weight of the stump. The weight of each stump is used in the voting process, if the greater the total weight obtained by one of the classes, then that class will be used as the final class.

Case Example using AdaBoost

The hotel is one of the lodgings most often used when traveling. With limited hotel capacity, canceling a reservation can be detrimental to the person providing the hotel service. In this case, we will predict hotel cancellations using data Hotel Reservation Requests taken from Kaggle.

booking <- read.csv("data_input/xgboost/hotel_bookings.csv", stringsAsFactors = T) 
head(booking)
#>          hotel is_canceled lead_time arrival_date_year arrival_date_month
#> 1 Resort Hotel           0       342              2015               July
#> 2 Resort Hotel           0       737              2015               July
#> 3 Resort Hotel           0         7              2015               July
#> 4 Resort Hotel           0        13              2015               July
#> 5 Resort Hotel           0        14              2015               July
#> 6 Resort Hotel           0        14              2015               July
#>   arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
#> 1                       27                         1                       0
#> 2                       27                         1                       0
#> 3                       27                         1                       0
#> 4                       27                         1                       0
#> 5                       27                         1                       0
#> 6                       27                         1                       0
#>   stays_in_week_nights adults children babies meal country market_segment
#> 1                    0      2        0      0   BB     PRT         Direct
#> 2                    0      2        0      0   BB     PRT         Direct
#> 3                    1      1        0      0   BB     GBR         Direct
#> 4                    1      1        0      0   BB     GBR      Corporate
#> 5                    2      2        0      0   BB     GBR      Online TA
#> 6                    2      2        0      0   BB     GBR      Online TA
#>   distribution_channel is_repeated_guest previous_cancellations
#> 1               Direct                 0                      0
#> 2               Direct                 0                      0
#> 3               Direct                 0                      0
#> 4            Corporate                 0                      0
#> 5                TA/TO                 0                      0
#> 6                TA/TO                 0                      0
#>   previous_bookings_not_canceled reserved_room_type assigned_room_type
#> 1                              0                  C                  C
#> 2                              0                  C                  C
#> 3                              0                  A                  C
#> 4                              0                  A                  A
#> 5                              0                  A                  A
#> 6                              0                  A                  A
#>   booking_changes deposit_type agent company days_in_waiting_list customer_type
#> 1               3   No Deposit  NULL    NULL                    0     Transient
#> 2               4   No Deposit  NULL    NULL                    0     Transient
#> 3               0   No Deposit  NULL    NULL                    0     Transient
#> 4               0   No Deposit   304    NULL                    0     Transient
#> 5               0   No Deposit   240    NULL                    0     Transient
#> 6               0   No Deposit   240    NULL                    0     Transient
#>   adr required_car_parking_spaces total_of_special_requests reservation_status
#> 1   0                           0                         0          Check-Out
#> 2   0                           0                         0          Check-Out
#> 3  75                           0                         0          Check-Out
#> 4  75                           0                         0          Check-Out
#> 5  98                           0                         1          Check-Out
#> 6  98                           0                         1          Check-Out
#>   reservation_status_date
#> 1              2015-07-01
#> 2              2015-07-01
#> 3              2015-07-02
#> 4              2015-07-02
#> 5              2015-07-03
#> 6              2015-07-03

The data contains 119390 observations and 32 variables. Here some description of each feature:

  • hotel: Hotel (H1 = Resort Hotel or H2 = City Hotel)

  • is_canceled: Value indicating if the booking was canceled (1) or not (0)

  • lead_time: Number of days that elapses between the entering date of the booking into the PMS and the arrival date

  • arrival_date_year: Year of arrival date

  • arrival_date_month: Month of arrival data

  • arrival_date_week_number: Week number of year for arrival date

  • arrival_date_day_of_month: Day of arrival date

  • stays_in_weekend_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

  • adults: Number of adults

  • children: Number of children

  • babies: Number of babies

  • meal: Type of meal booked. Categories are presented in standard hospitality meal packages:
    • Undefined/SC : no meal package;
    • BB : Bed & Breakfast;
    • HB : Half board (breakfast and one other meal-usually dinner);
    • FB : Full board (breakfast, lunch, and dinner)
  • country: Country of origin. Categories are represented in the ISO 3155-3:2013 format

  • market_segment: Market segment designation. In categories, the term “TA” means “Travel agents” and “TO” means “Tour Operators”

  • distribution_channel: Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators”

  • is_repeated_guest: Value indicating if the booking name was from a repeated guest (1) or not (0)

  • previous_cancellations: Number of previous bookings that were cancelled bu the customer prior to the current booking

  • previous_bookings_not_canceled: Number of previous bookings not cancelled by the customer prior to the current booking

  • reserved_room_type: Code of room type reserved. Code is represented instead of designation for anonymity reasons

  • assigned_room_type: Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel opeartions reasons (e.g overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

  • booking_changes: Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in cancellation

  • deposit_type: Indication on if the customer made a deposit to guarantee the booking. This variable can assume three categories:
    • No deposit - no deposit was made;
    • Non refund - a deposit was made in the value of the total stay cost;
    • Refundable - a deposit was made with a value under the total cost of stay
  • agent: ID of the travel agency that made the booking

  • company: ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instad of designation for anonymity reasons

  • days_in_waiting_list: Number of days the booking was in the waiting list before it was confirmed to the customer

  • customer_type: Type of booking, assuming one of four categories:
    • Contract - when the booking has an allotment or other type of contract associated to it;
    • Group - when the booking is associated to a group;
    • Transient - when the booking is not part of a group or contract, and is not associated to other transient booking
    • Transient-party - when the booking is transient, but is associated to at least other transient booking
  • adr: Average daily rate as defined by dividing the sum of all lodging transactions by the total number of staying nights

  • required_car_parking_spaces: Number of car parking spaces required by the customer

  • total_of_special_requests: Number of special requests made by the customer (e.g. twin bed or high floor)

  • reservation_status: Reservation las status, assuming one of three categories:
    • Canceled - booking was canceled by the customer;
    • Check-out - customer has checked in but already departed;
    • No-Show - customer did not check-in and did inform the hotel of the reason why
  • reservation_status_date: Date a which the last status was set. This variable can be used in conjuction with the reservation status to understand when was the booking canceled or when did the customer checked-out of the model.

The model prediction will help hotel to predict the guest will cancel or not cancel the booking hotel. We will remove variables agent and company because there are have a lot of levels, and also we remove reservation_status and reservation_status_date.

booking <- booking %>% 
          select(-c(reservation_status_date, agent, company,
                    reservation_status)) %>% 
          mutate(is_canceled = as.factor(is_canceled))

Exploratory Data Analysis

Before we go further, we need to check if there are any missing values in data. We can used inspect_na() function from inspectdf package to check the missing value.

booking %>% 
  inspect_na()
#> # A tibble: 28 x 3
#>    col_name                    cnt    pcnt
#>    <chr>                     <int>   <dbl>
#>  1 children                      4 0.00335
#>  2 hotel                         0 0      
#>  3 is_canceled                   0 0      
#>  4 lead_time                     0 0      
#>  5 arrival_date_year             0 0      
#>  6 arrival_date_month            0 0      
#>  7 arrival_date_week_number      0 0      
#>  8 arrival_date_day_of_month     0 0      
#>  9 stays_in_weekend_nights       0 0      
#> 10 stays_in_week_nights          0 0      
#> # ... with 18 more rows

From the result above variable children have missing values with 4 observation, let’s fill the missing value with the 0.

booking <- booking %>% 
           mutate(children = replace_na(children,0))

Now let’s check the condition of the categorical variable using inspect_cat() function.

booking %>% 
  inspect_cat()
#> # A tibble: 11 x 5
#>    col_name               cnt common     common_pcnt levels            
#>    <chr>                <int> <chr>            <dbl> <named list>      
#>  1 arrival_date_month      12 August            11.6 <tibble [12 x 3]> 
#>  2 assigned_room_type      12 A                 62.0 <tibble [12 x 3]> 
#>  3 country                178 PRT               40.7 <tibble [178 x 3]>
#>  4 customer_type            4 Transient         75.1 <tibble [4 x 3]>  
#>  5 deposit_type             3 No Deposit        87.6 <tibble [3 x 3]>  
#>  6 distribution_channel     5 TA/TO             82.0 <tibble [5 x 3]>  
#>  7 hotel                    2 City Hotel        66.4 <tibble [2 x 3]>  
#>  8 is_canceled              2 0                 63.0 <tibble [2 x 3]>  
#>  9 market_segment           8 Online TA         47.3 <tibble [8 x 3]>  
#> 10 meal                     5 BB                77.3 <tibble [5 x 3]>  
#> 11 reserved_room_type      10 A                 72.0 <tibble [10 x 3]>

From the result above, the country column has 178 unique value. We will reduce the unique value of the country to 11, namely by taking the 10 countries that appear most frequently and other countries will be changed to other.

booking <- booking %>% 
  mutate(country = fct_lump_n(country, n = 10)) 

booking %>% 
  inspect_cat() %>% 
  show_plot()

Before we do the modeling, let’s first check the proportions of the target class to find out how balanced the target class.

booking %>% 
  pull(is_canceled) %>% 
  table() %>% 
  prop.table()
#> .
#>         0         1 
#> 0.6295837 0.3704163

Class with label 0 has a proportion of about 63% of the data while class with label 1 has a proportion of 37%, this shows that class with label 0 is more dominant.

Modelling

We’ll create our training and testing data using initial_split function

set.seed(100)
splitted <- initial_split(booking, prop = 0.8,strata = is_canceled)
data_train <- training(splitted)
data_test <- testing(splitted)

The function used to create the AdaBoost model is adaboost() from the fastAdaboost package. There are 3 parameters that can be filled in this function:
- formula: Formula for models
- data: Data used in the modeling process
- nIter: Number of stumps used on the model

model_ada <- adaboost(formula = is_canceled~.,
                      data = data_train, 
                      nIter = 100)

As we know each stump in the model has a different weight, the weight of each stump can be seen in model_ada$weights. When the weights are visualized, it will be seen that the stump formed at the end of the iteration has a smaller weight when compared to the stump formed at the beginning of the iteration.

plot_weights <- data.frame(stump_id = c(1:100), 
           weight = model_ada$weights) %>% 
  ggplot(aes(y = weight, x = stump_id)) +
  geom_col(fill = "dodgerblue3")
plot_weights

Now let’s predict the test dataset

pred_hotel <- predict(object = model_ada, newdata = data_test)
str(pred_hotel)
#> List of 5
#>  $ formula:Class 'formula'  language is_canceled ~ .
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>  $ votes  : num [1:23877, 1:2] 22.2 23.6 14.2 22.5 17.4 ...
#>  $ class  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 2 ...
#>  $ prob   : num [1:23877, 1:2] 0.9 0.956 0.576 0.912 0.704 ...
#>  $ error  : num 0.113

the predicted object has several components :
- $votes : Total weighted votes achieved by each class
- $class : The class predicted by the classifier
- $prob : A matrix with predicted probability of each class for each observation
- $eror : The error on the test data if labeled. (1-accuracy)

Now let’s check how good our model using confusion matrix

confusionMatrix(data = pred_hotel$class, reference = data_test$is_canceled, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 13725  1400
#>          1  1308  7444
#>                                               
#>                Accuracy : 0.8866              
#>                  95% CI : (0.8825, 0.8906)    
#>     No Information Rate : 0.6296              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.7563              
#>                                               
#>  Mcnemar's Test P-Value : 0.08034             
#>                                               
#>             Sensitivity : 0.8417              
#>             Specificity : 0.9130              
#>          Pos Pred Value : 0.8505              
#>          Neg Pred Value : 0.9074              
#>              Prevalence : 0.3704              
#>          Detection Rate : 0.3118              
#>    Detection Prevalence : 0.3665              
#>       Balanced Accuracy : 0.8773              
#>                                               
#>        'Positive' Class : 1                   
#> 

Based on the confusion matrix above, we know that the accuracy of model is 0.88. Since we know that our data is dominated by the class labeled 0 (67%) we have to use another metrics to find out how well our model predicts the two classes. we’re going to use the AUC.

pred_df <- pred_hotel$prob %>% 
  as.data.frame() %>% 
  rename(class0 = V1, 
         class1 = V2) %>% 
  mutate(predicted = pred_hotel$class, 
         actual = data_test$is_canceled)

auc_ada <- roc_auc(data = pred_df, truth = actual,class1) 
auc_ada
#> # A tibble: 1 x 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.955

The AUC results show that the model formed is good at predicting the target class, this is indicated by the AUC value of 0.95 (the closer to 1 the better).

AdaBoost has a lot of advantages, mainly it is easier to use with less need for tweaking parameters unlike algorithms like XGBoost. AdaBoost also can reduce the variance in testing data.

XGBoost

XGBoost was formulated by Tianqi Chen which started as a research project a part of The Distributed Deep Machine Leaning Community (DMLC) grop. XGBoost is one of popular algorithm because it has been the winning algorithm in a number of recent Kaggle competitions. XGBoost is a specific implementation of the Gradient Boosting Model which uses more accurate approximations to find the best tree model[^2]. XGBoost specifically used a more regularized model formalization to control overfitting, which gives it better perfomance.

How XGBoost works?

System Optimization: [^5]

  1. Parallelized tree building

XGBoost approaches the process of sequential tree building using parrellelized implementation.

  1. Tree pruning

Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree up to max_depth and then prune backward until the improvement in loss function is below a threshold.

  1. Cache awareness and out of core computing

XGBoost has been designed to efficiently reduce computing time and allocate an optimal usage of memory resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.

  1. Regularization

The biggest advantage of xgboost is regularization. Regularization is a technique used to avoid overfitting in linear and tree based models which limits, regulates or shrink the estimated coefficient towards zero.

  1. Handles missing value

This algorithm has important features of handling missing values by learns the best direction for missing values. The missing values are treated them to combine a sparsity-aware split finding algorithm to handle different types of sparsity patterns in data.

  1. Built-in cross validation

The algorithm comes with built in cross validation method at each iteration, taking away the need to explicitly program this search and to specify the exact number of boosting iterations required in a single run.

Regularization and training loss

XGBoost offers additional regularization term controls the complexity of the model, which help us to avoid overfitting. The objective function is to measure how well the model fit the training data. They consist of two parts: training loss and the regularization term:

\(obj(\theta )= L(\theta )+\Omega (\theta )\)

Where \(L\) is training loss function and \(\Omega\) is regularization. Training loss function measures how well model fit on training data, \(\Omega\) will reduce the complexity of the tree functions.[^3]

For regression case, training loss function will obtain from Mean Squared Error value:

\(L(\theta ) = {\sum_i^{n}(y_i-\hat{y}_i)^2}\)

Another loss function for classification case:

\(L(\theta ) = {\sum_i[y_iln(1+e^{-\hat{y}_i})+(1-y_i)ln(1+e^{\hat{y}_i})]}\)

Case Example using XGBoost

Modelling

set.seed(100)
splitted <- initial_split(booking, prop = 0.8,strata = is_canceled)
data_train <- training(splitted)
data_test <- testing(splitted)

Split the target variable into label_train and label_test

label_train <- as.numeric(as.character(data_train$is_canceled))
label_test <- as.numeric(as.character(data_test$is_canceled))

The most important thing when we work with XGBoost is converting the data to Dmatrix, because XGBoost requires a matrix input for the features.

# convert data to matrix
train_matrix <- data.matrix(data_train[,-2])
test_matrix <- data.matrix(data_test[,-2])
# convert data to Dmatrix
dtrain <- xgb.DMatrix(data = train_matrix, label = label_train)
dtest <- xgb.DMatrix(data = test_matrix, label = label_test)

Tuning Parameters

There is no benchmark to define the ideal parameters because it will depend on your data and specific problem. XGBoost Parameters can defined into three categories:[^6]

General Parameters

Controls the booster type in the model which eventually drives overall functioning

  1. booster

For classification problems, we can use gbtree parameter. In gbtree a tree is grown one after other and attempts to reduce misclassification reate in subsequent iterations. The next tree is built by giving a higher weight to misclassified points by the previous tree.

For regression problems, we can use gbtree and gblinear. In gblinear, it builds a generalized linear model and optimizes it using regularization and gradient descent. The next model will built on residuals generated by previous iterations.

  1. nthread

To enable parallel computing. The default is the maximum number of cores available

  1. verbosity

Verbosity to display warning messages.The default value is 1 (warning), 0 for silent, 2 for info, and 3 for debug.

Booster Parameters:

Controls the performance of the selected booster

  1. Eta

The range of eta is 0 to 1 and default value is 0.3. It controls the maximum number of iterations, the lower eta will generate the slower computation.

  1. Gamma

The range of gamma is 0 to infinite and default value is 0 (no regularization). The higher gamma is the higher regularization, regularization means penalizing large coefficients that don’t improve the model’s performance.

  1. nrounds

it refers to controls the maximum number of iterations.

  1. nfold

The number of observation data is randomly partitioned into nfold equal size subsamples

  1. max_depth

Maximum depth of a tree. The range of max_depth is 0 to infinite and default value is 6, increasing this value will make the model more complex and more likely to overfit.

  1. Min_child_weight

The range of min_child_weight is 0 to infinite and default value is 1. If the leaf node has a minimum sum of instance weight lower than min_child_weight in the tree partition step than the process of splitting the tree will stop growing.

  1. subsample

The range of subsample is 0 to 1 and default value is 1. It controls the number of ratio observations to a tree. If the value is set to 0.5 means that XGboost would randomly sample half of the training data prior to growing trees and this will prevent overfitting. subsample will occur once in every boosting iteration.

  1. colsampe_bytree

The range of colsample_bytree is 0 to 1 and default value is 1. It controls the subsample ratio of columns when constructing each tree.

Learning Task Parameters

Sets and evaluates the learning process of booster from the given data.

  1. Objective
  • reg:squarederror for regression with squared loss
  • binary:logistic for binary classification
  1. eval_metric

Evaluation metrics for validaton data. The default is RMSE for regression case and error for classification case.

Next, we define the parameter will be used:

params <- list(booster = "gbtree",
               objective = "binary:logistic",
               eta=0.1, 
               gamma=10, 
               max_depth=10, 
               min_child_weight=1, 
               subsample=1, 
               colsample_bytree=1)

One of the simplest way to see the training progress is to set the verbose option as TRUE.

tic()
xgbcv <- xgb.cv( params = params, 
                 data = dtrain,
                 nrounds = 1000, 
                 showsd = T, 
                 nfold = 5,
                 stratified = T, 
                 print_every_n = 50, 
                 early_stopping_rounds = 20, 
                 maximize = F)
#> [1]  train-error:0.159371+0.000575   test-error:0.162041+0.002767 
#> Multiple eval metrics are present. Will use test_error for early stopping.
#> Will train until test_error hasn't improved in 20 rounds.
#> 
#> [51] train-error:0.134139+0.000721   test-error:0.140127+0.002438 
#> [101]    train-error:0.122489+0.000533   test-error:0.132233+0.001973 
#> Stopping. Best iteration:
#> [114]    train-error:0.121790+0.000932   test-error:0.131793+0.002073
print(xgbcv)
#> ##### xgb.cv 5-folds
#>     iter train_error_mean train_error_std test_error_mean test_error_std
#>        1        0.1593710    0.0005746874       0.1620406    0.002766615
#>        2        0.1589574    0.0008488257       0.1616114    0.002560948
#>        3        0.1582874    0.0008635417       0.1612242    0.002240224
#>        4        0.1574078    0.0012001357       0.1600936    0.002275888
#>        5        0.1563270    0.0017212650       0.1593920    0.001805679
#> ---                                                                     
#>      130        0.1217530    0.0009838978       0.1318562    0.002019269
#>      131        0.1217504    0.0009883673       0.1318562    0.002019269
#>      132        0.1217504    0.0009883673       0.1318562    0.002019269
#>      133        0.1217504    0.0009883673       0.1318562    0.002019269
#>      134        0.1217504    0.0009883673       0.1318562    0.002019269
#> Best iteration:
#>  iter train_error_mean train_error_std test_error_mean test_error_std
#>   114        0.1217898    0.0009316736       0.1317934    0.002072701
toc()
#> 72.96 sec elapsed
tic()
xgb1 <- xgb.train (params = params, 
                   data = dtrain, 
                   nrounds = xgbcv$best_iteration, 
                   watchlist = list(val=dtest,train=dtrain),
                   print_every_n = 100, 
                   early_stoping_rounds = 10, 
                   maximize = F , 
                   eval_metric = "error",
                   verbosity = 0)
#> [1]  val-error:0.157641  train-error:0.158900 
#> [101]    val-error:0.125644  train-error:0.121533 
#> [114]    val-error:0.124011  train-error:0.119858
toc()
#> 15.86 sec elapsed
xgbpred_prob <-predict(object = xgb1, newdata = dtest)
xgbpred <- ifelse (xgbpred_prob > 0.5,1,0)

In this section, we evaluate the performance of XGBoost model

confusionMatrix(as.factor(xgbpred), as.factor(label_test))
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 13902  1830
#>          1  1131  7014
#>                                                
#>                Accuracy : 0.876                
#>                  95% CI : (0.8717, 0.8801)     
#>     No Information Rate : 0.6296               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7297               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.9248               
#>             Specificity : 0.7931               
#>          Pos Pred Value : 0.8837               
#>          Neg Pred Value : 0.8611               
#>              Prevalence : 0.6296               
#>          Detection Rate : 0.5822               
#>    Detection Prevalence : 0.6589               
#>       Balanced Accuracy : 0.8589               
#>                                                
#>        'Positive' Class : 0                    
#> 

let’s check the variable importance from the model:

var_imp <- xgb.importance(model = xgb1,
                          feature_names = dimnames(dtrain)[[2]])
var_imp %>% 
  mutate_if(is.numeric, round, digits = 2)
#>                            Feature Gain Cover Frequency
#>  1:                   deposit_type 0.35  0.07      0.01
#>  2:                        country 0.13  0.12      0.10
#>  3:                      lead_time 0.09  0.11      0.13
#>  4:                 market_segment 0.07  0.06      0.04
#>  5:      total_of_special_requests 0.06  0.06      0.03
#>  6:    required_car_parking_spaces 0.05  0.08      0.02
#>  7:         previous_cancellations 0.04  0.05      0.03
#>  8:              arrival_date_year 0.04  0.04      0.06
#>  9:                            adr 0.03  0.10      0.14
#> 10:                  customer_type 0.02  0.03      0.04
#> 11:       arrival_date_week_number 0.02  0.04      0.09
#> 12:             reserved_room_type 0.02  0.03      0.03
#> 13:                booking_changes 0.01  0.03      0.03
#> 14:             assigned_room_type 0.01  0.05      0.03
#> 15: previous_bookings_not_canceled 0.01  0.02      0.02
#> 16:                          hotel 0.01  0.02      0.02
#> 17:           stays_in_week_nights 0.01  0.02      0.03
#> 18:             arrival_date_month 0.01  0.01      0.02
#> 19:      arrival_date_day_of_month 0.01  0.01      0.03
#> 20:        stays_in_weekend_nights 0.01  0.01      0.02
#> 21:                           meal 0.01  0.02      0.02
#> 22:           distribution_channel 0.00  0.01      0.01
#> 23:                         adults 0.00  0.01      0.02
#> 24:                       children 0.00  0.01      0.01
#> 25:              is_repeated_guest 0.00  0.01      0.01
#> 26:           days_in_waiting_list 0.00  0.01      0.01
#> 27:                         babies 0.00  0.00      0.00
#>                            Feature Gain Cover Frequency

The function of xgb.importancedisplays the result importance values calculated with different importance metrics:

  • The gain value means the percentage contribution of the feature for each tree in the model

  • The cover value means the percentage represents the number of observations for each feature from all trees. From example, if we have 100 observations with 3 tree and each tree have 5, 8, and 10 observations for feature “A”. The cover value will calculate 5+8+10 = 23 observations from all trees for each feature. In this case, the feature “A” has a 0.23 cover value.

  • The frequency value means the percentage representing the number of times a feature will splits in the trees of the model. For example, feature “A” occurred in 3 splits, 2 splits, and 2 splits for each tree. So the value of frequency feature “A” is 3+2+2=7 splits divide with all total numbers splits for all features.

xgb.ggplot.importance(var_imp,top_n = 10) + theme_minimal()

The graph shows the variable importance used the gain value by default and it also displays the cluster of features that have similar feature importances value. From 10 above features means their features have a significant impact on the result of prediction.

Next, we evaluate the perfomance model on the ROC curver

xgb_result <- data.frame(class1 = xgbpred_prob, actual = as.factor(label_test))

auc_xgb <- roc_auc(data = xgb_result, truth = actual,class1) 

result <- rbind(auc_ada, auc_xgb) %>% 
          mutate(model = c("AdaBoost", "XGBoost")) %>% 
          select(model, everything())
result
#> # A tibble: 2 x 4
#>   model    .metric .estimator .estimate
#>   <chr>    <chr>   <chr>          <dbl>
#> 1 AdaBoost roc_auc binary         0.955
#> 2 XGBoost  roc_auc binary         0.948

The AUC results show that AdaBoost and XGBoost model have similar value 0.94 and 0.95. To obtain the AdaBoost model we need to run model for 60 minutes, while the XGBoost model only need ~60 seconds. We can say that XGBoost works better than AdaBoost for speed.

Conclusion

In this article, we described the lesson on how to building and how AdaBoost and XGBoost model works. We can conclude several points:

  • Both of two algorithms are built based on converting weak learners to a strong learner

  • AdaBoost has only a few hyperparameters to improve the model but this model is easy to understand and to visualize

  • The decision which algorithm will be used depends on our data set, for low noise data and timeliness of result is not the main concern, we can use AdaBoost model

  • For complexity and high dimension data, XGBoost performs works better than Adaboost because XGBoost have system optimizations.

comments powered by Disqus