15  Boosting Methods

Abstract
TODO (150-200 WORDS)
Major changes expected!

This page is a work in progress and major changes will be made over time.

15.1 Gradient Boosting Machines

15.1.1 Gradient Boosting Machines for Regression

Boosting is a machine learning strategy that can be applied to any model class. Similarly to random forests, boosting is an ‘ensemble’ method that creates a model from a ‘committee’ of learners. The committee is formed of ‘weak’ learners that make poor predictions individually, which creates a ‘slow learning’ approach (as opposed to ‘greedy’) that requires many iterations for a model to be a good fit to the data. Boosting models are similar to random forests in that both make predictions from a large committee of learners. However the two differ in how this committee is combined to a prediction. In random forest algorithms, each decision tree is grown independently and their predictions are combined by a simple mean calculation. In contrast, weak learners in a boosting model are fit sequentially and predictions are made by a linear combination of predictions from each learner. With respect to transparency, it is simpler to inspect 100 trees in a random forest, than it is to inspect 100 weak learners in a boosted model, though both are considered black-box models.

The best known boosting algorithm is likely AdaBoost (Freund and Schapire 1996), which is more generally a Forward Stagewise Additive Model (FSAM) with an exponential loss (Hastie, Tibshirani, and Friedman 2001). Today, the most widely used boosting model is the Gradient Boosting Machine (GBM) (J. H. Friedman 2001).

Training a GBM

Pseudo-code for training a componentwise GBM is presented in (4). The term ‘componentwise’ is explained fully below, only this variation of GBM is presented as it is the most common in implementation (Greenwell et al. 2019; Hothorn et al. 2020). Line 1: the initial function is initialized as \(g_0 = 0\); Line 2: iterate over boosting steps \(m = 1,...,M\) and; Line 3: randomly sample the training data, \(\mathcal{D}_{train}\), to a smaller sample, \(\mathcal{D}_{train}^*\), this may be ignored if \(\phi = 1\); Line 4: for all training observations in the reduced dataset, \(i \in \{i:X_i \in \mathcal{D}_{train}^*\}\), compute the negative gradient, \(r_{im}\), of the differentiable loss function, \(L\), with respect to predictions from the previous iteration, \(g_{m-1}(X_i)\); Line 5: fit one weak learner for each feature, \(j = 1,...,p\), in the training data, where the feature, \(X_{;j}\), is the single covariate and \(r_{im}\) are the labels; Line 6: select the optimal weak learner as the one that minimises the squared error between the prediction and the true gradient; Line 7: update the fitted model by adding the optimal weak learner with a shrinkage penalty, \(\nu\); Line 9: return the model updated in the final iteration as the fitted GBM.

Predicting with a GBM

In general, predictions from a trained GBM are simple to compute as the fitted model (and all individual weak learners) take the same inputs, which are passed sequentially to each of the weak learners. In (4), the fitted GBM is a single model, which is a linear combination of weak learners. Instead one could think of the returned model as a collection of the optimal weak learners, i.e. let \(w_{m;j^*}\) be the optimal weak learner from iteration \(m\) and let the fitted GBM (Line 9 (4)) be \(\hat{g}:= \{w_{m;j^*}\}^M_{m=1}\). With this formulation, making predictions from the GBM can be demonstrated simply in ((alg-surv-gbm-pred?)).

The biggest advantages of boosting are firstly relatively few hyper-parameters, which all have a meaningful and intuitive interpretation, and secondly its modular nature means that, like random forests, relatively few parts need to be updated to derive a novel model. First the model components will be discussed and then the hyper-parameters. Once this has been established, deriving survival variants can be simply presented.

15.1.1.1 Losses and Learners

Losses

Building a GBM requires selection of the loss to minimise, \(L\), selection of weak learners, \(w_j\), and a method to compare the weak learners to the loss gradient. The only constraint in selecting a loss, \(L\), is that it must be differentiable with respect to \(g(X)\) (Hastie, Tibshirani, and Friedman 2001). Of course a sensible loss should be chosen (a classification loss should not be used for regression) and different choices of losses will optimise different tasks. \(L_2\)-losses have been demonstrated to be effective for regression boosting, especially with high-dimensional data (Bühlmann and Yu 2003); this is referred to as \(L_2\)-boosting.

Weak Learners

  1. is specifically a componentwise GBM (Bühlmann and Yu 2003), which means that each of the \(p\) weak learners is fit on a single covariate from the data. This method simplifies selecting the possible choices for the weak learners to selecting the class of weak learner (below). Additionally, componentwise GBMs provide a natural and interpretable feature selection method as selecting the optimal learner ((4), line 6) corresponds to selecting the feature that minimises the chosen loss in iteration \(m\).

Only three weak, or ‘base’, learner classes are commonly used in componentwise GBMs (Hothorn et al. 2020; Wang and Wang 2010). These are linear least squares (J. H. Friedman 2001), smoothing splines (Bühlmann and Yu 2003), and decision stumps (Bühlmann and Yu 2003; J. H. Friedman 2001). Let \(L\) be a loss with negative gradient for observation \(i\) in the \(m\)th iteration, \(r_{im}\), and let \(\mathcal{D}_{train}\) be the usual training data. For linear least squares, an individual weak learner is fit by (J. H. Friedman 2001; Wang and Wang 2010), \[ w_j(\mathcal{D}_{train}) = X_{;j}\frac{\sum^n_{i=1} X_{ij}r_{im}}{\sum^n_{i=1} (X_{ij})^2} \] For smoothing splines, usually cubic splines are implemented, these fit weak learners as the minimisers of the equation (Bühlmann and Yu 2003), \[ w_j := \operatornamewithlimits{argmin}_{g \in \mathcal{G}} \frac{1}{n} \sum^{n}_{i = 1} (r_{im} - g(X_{ij}))^2 + \lambda \int (g''(u))^2 du \] where \(g''\) is the second derivative of \(g\), \(\mathcal{G}\) is the set of functions, \ \(\mathcal{G}:= \{g: g \text{ is twice continuously differentiable and } \int (g''(x))^2 dx < \infty\}\), and \(\lambda\) is a hyper-parameter usually chosen so that the number of degrees of freedom, df, is small, with df \(\approx 4\) suggested (Bühlmann and Yu 2003; Schmid and Hothorn 2008a; Wang and Wang 2010).

Finally for decision stumps ((?fig-surv-stump)), a decision tree, \(w_j\), is grown ((alg-dt-fit?)) on \((X_{;j}, r_m)\) to depth one (equivalently to two terminal nodes) for each of the \(j = 1,...,p\) covariates (J. H. Friedman 2001).

15.1.1.2 Hyper-Parameters

The hyper-parameters in (4) are the ‘step-size’, \(\nu\), the sampling fraction, \(\phi\), and the number of iterations, \(M\).

Number of iterations, \(M\)

The number of iterations is often claimed to be the most important hyper-parameter in GBMs and it has been demonstrated that as the number of iterations increases, so too does the model performance (with respect to a given loss on test data) up to a certain point of overfitting (Buhlmann 2006; Hastie, Tibshirani, and Friedman 2001; Schmid and Hothorn 2008a). This is an intuitive result as the foundation of boosting rests on the idea that weak learners can slowly be combined to form a single powerful model. This is especially true in componentwise GBMs as time is required to learn which features are important. Finding the optimal value of \(M\) is critical as a value too small will result in poor predictions, whilst a value too large will result in model overfitting. Two primary methods have been suggested for finding the optimal value of \(M\). The first is to find the \(M \in \mathbb{N}_{> 0}\) that minimises a given measure based on the AIC (Akaike 1974), the second is the ‘usual’ empirical selection by nested cross-validation. In practice the latter method is usually employed.

Step-size, \(\nu\)

The step-size parameter ((4), line 7), \(\nu\), is a shrinkage parameter that controls the contribution of each weak learner at each iteration. Several studies have demonstrated that GBMs perform better when shrinkage is applied and a value of \(\nu = 0.1\) is often suggested (Buhlmann and Hothorn 2007; Hastie, Tibshirani, and Friedman 2001; J. H. Friedman 2001; Lee, Chen, and Ishwaran 2019; Schmid and Hothorn 2008a). The optimal values of \(\nu\) and \(M\) depend on each other, such that smaller values of \(\nu\) require larger values of \(M\), and vice versa. This is intuitive as smaller \(\nu\) results in a slower learning algorithm and therefore more iterations are required to fit the model. Accurately selecting the \(M\) parameter is generally considered to be of more importance, and therefore a value of \(\nu\) is often chosen heuristically (e.g. the common value of \(0.1\)) and then \(M\) is tuned by cross-validation and/or early-stopping.

Sampling Fraction, \(\phi\)

Motivated by the success of bagging in random forests, stochastic gradient boosting (J. Friedman 1999) randomly samples the data in each iteration. It appears that subsampling performs best when also combined with shrinkage (Hastie, Tibshirani, and Friedman 2001) and as with the other hyper-parameters, selection of \(\phi\) is usually performed by nested cross-validation.

15.1.2 Gradient Boosting Machines for Survival Analysis

In a componentwise GBM framework, adapting boosting to survival analysis requires only selecting a sensible choice of loss function \(L\). Therefore fitting and predicting algorithms for componentwise survival GBMs are not discussed as these are fully described in algorithms (4) and ((alg-surv-gbm-pred?)) respectively. However, some GBMs in this section are not componentwise and therefore require some more detailed consideration. Interestingly, unlike other machine learning algorithms that historically ignored survival analysis, early GBM papers considered boosting in a survival context (Ridgeway 1999); though there appears to be a decade gap before further considerations were made in the survival setting. After that period, several developments by Binder, Schmid, and Hothorn, adapted componentwise GBMs to a framework suitable for survival analysis. Their developments are covered exhaustively in the R packages \(\textbf{gbm}\) (Greenwell et al. 2019) and \(\textbf{mboost}\) (Hothorn et al. 2020). This survey continues with the predict type taxonomy.

15.1.2.1 Cox Survival Models

All survival GBMs make ranking predictions and none are able to directly predict survival distributions. However, the GBMs discussed in this section all have natural compositions to distributions as they are modelled in the semi-parametric proportional hazards framework (Chapter 19). The models discussed in the next section can also be composed to distributions though the choice of composition is less clear and therefore they are listed as pure ‘ranking’ models.

GBM-COX {#mod-gdcox} {#mod-gbmcox}\ The ‘GBM-COX’ aims to predict the distribution of data following the PH assumption by estimating the coefficients of a Cox model in a boosting framework (Ridgeway 1999). The model attempts to predict \(\hat{g}(X^*) = \hat{\eta} := X^*\hat{\beta}\), by minimising a suitable loss function. As the model assumes a PH specification, the natural loss to optimise is the Cox partial likelihood (Cox 1972, 1975), more specifically to minimise the negative partial log-likelihood, \(-l\), where

\[ l(\beta) = \sum^n_{i=1} \Delta_i \Big[\eta_i \ - \ \log\Big(\sum^n_{j \in \mathcal{R}_{t_i}} \exp(\eta_i)\Big)\Big] \tag{15.1}\] where \(\mathcal{R}_{t_i}\) is the set of patients at risk at time \(t_i\) and \(\eta_i = X_i\beta\). The gradient of \(-l(\beta)\) at iteration \(m\) is \[ r_{im} := \Delta_i - \sum^n_{j=1} \Delta_j \frac{\mathbb{I}(T_i \geq T_j) \exp(g_{m-1}(X_i))}{\sum_{k \in \mathcal{R}_{t_j}} \exp(g_{m-1}(X_k))} \tag{15.2}\] where \(g_{m-1}(X_i) = X_i\beta_{m-1}\).

  1. now follows with the loss \(L := -l(\beta)\).

The GBM-COX is implemented in \(\textbf{mboost}\) (Hothorn et al. 2020) and has been demonstrated to perform well even when the data violates the PH assumption (Johnson and Long 2011). Despite being a black-box, GBMs are well-understood and individual weak learners are highly interpretable, thus making GBMs highly transparent. Several well-established software packages implement GBM-COX and those that do not tend to be very flexible with respect to custom implementations.

*CoxBoost** {#mod-coxboost}\ The CoxBoost algorithm boosts the Cox PH by optimising the penalized partial-log likelihood; additionally the algorithm allows for mandatory (or ‘forced’) covariates (Harald Binder and Schumacher 2008). In medical domains the inclusion of mandatory covariates may be essential, either for model interpretability, or due to prior expert knowledge. This is not a feature usually supported by boosting. CoxBoost deviates from (4) by instead using an offset-based approach for generalized linear models (Tutz and Binder 2007). This model has a non-componentwise and componentwise framework but only the latter is implemented by the authors (Harold Binder 2013) and discussed here. Let \(\mathcal{I}_{mand}\) be the indices of the mandatory covariates to be included in all iterations, \(m = 1,...,M\), then for an iteration \(m\) the indices to consider for fitting are the set \[ I_m = \{\{1\} \cup \mathcal{I}_{mand},...,\{p\} \cup \mathcal{I}_{mand}\} / \{\{j\} \cup \mathcal{I}_{mand} : j \in \mathcal{I}_{mand}\} \] i.e. in each iteration the algorithm fits a weak learner on the mandatory covariates and one additional (non-mandatory) covariate (hence still being componentwise).

In addition, a penalty matrix \(\mathbf{P} \in \mathbb{R}^{p \times p}\) is considered such that \(P_{ii} > 0\) implies that covariate \(i\) is penalized and \(P_{ii} = 0\) means no penalization. In practice this is usually a diagonal matrix (Harald Binder and Schumacher 2008) and by setting \(P_{ii} = 0, i \in I_{mand}\) and \(P_{ii} > 0, i \not\in I_{mand}\), only optional (non-mandatory) covariates are penalized. The penalty matrix can be allowed to vary with each iteration, which allows for a highly flexible approach, however in implementation a simpler approach is to either select a single penalty to be applied in each iteration step or to have a single penalty matrix (Harold Binder 2013).

At the \(m\)th iteration and the \(k\)th set of indices to consider (\(k = 1,...,p\)), the loss to optimize is the penalized partial-log likelihood given by \[ \begin{split} &l_{pen}(\gamma_{mk}) = \sum^n_{i=1} \Delta_i \Big[\eta_{i,m-1} + X_{i,\mathcal{I}_{mk}}\gamma^T_{mk}\Big] - \\ &\quad\Delta_i\log\Big(\sum^n_{j = 1} \mathbb{I}(T_j \leq T_i) \exp(\eta_{i,{m-1}} + X_{i, \mathcal{I}_{mk}}\gamma^T_{mk}\Big) - \lambda\gamma_{mk}\mathbf{P}_{mk}\gamma^T_{mk} \end{split} \]

where \(\eta_{i,m} = X_i\beta_m\), \(\gamma_{mk}\) are the coefficients corresponding to the covariates in \(\mathcal{I}_{mk}\) which is the possible set of candidates for a subset of total candidates \(k = 1,...,p\), \(\mathbf{P}_{mk}\) is the penalty matrix, and \(\lambda\) is a penalty hyper-parameter to be tuned or selected.

In each iteration, all potential candidate sets (the union of mandatory covariates and one other covariate) are updated by \[ \hat{\gamma}_{mk} = \mathbf{I}^{-1}_{pen}(0)U(0) \] where \(U(\gamma) = \partial l / \partial \gamma (\gamma)\) and \(\mathbf{I}^{-1}_{pen} = \partial^2 l/\partial\gamma\partial\gamma^T (\gamma + \lambda\mathbf{P}_{mk})\) are the first and second derivatives of the unpenalized partial-log-likelihood. The optimal set is then found as \[ k^* := \operatornamewithlimits{argmax}_k l_{pen}(\gamma_{mk}) \] and the estimated coefficients are updated with \[ \hat{\beta}_m = \hat{\beta}_{m-1} + \gamma_{mk^*}, \quad k^* \in \mathcal{I}_{mk} \] The step size, \(\nu\), is then one, but this could potentially be altered.

The algorithm deviates from (4) as \(l_{pen}\) is directly optimised and not its gradient, additionally model coefficients are iteratively updated instead of a more general model form. The algorithm is implemented in \(\textbf{CoxBoost}\) (Harold Binder 2013). Experiments suggest that including the ‘correct’ mandatory covariates may increase predictive performance (Harald Binder and Schumacher 2008). CoxBoost is less accessible than other boosting methods as it requires a unique boosting algorithm, as such only one off-shelf implementation appears to exist and even this implementation has been removed from CRAN as of 2020-11-11. CoxBoost is also less transparent as the underlying algorithm is more complex, though is well-explained by the authors (Harald Binder and Schumacher 2008). There is good indication that CoxBoost is performant (Sonabend 2021). In a non-medical domain, where performance may be the most important metric, then perhaps CoxBoost can be recommended as a powerful model. However, when sensitive predictions are required, CoxBoost may not be recommended. Further papers studying the model and more off-shelf implementations could change this in the future.

15.1.2.2 Ranking Survival Models

The ranking survival models in this section are all unified as they make predictions of the linear predictor, \(\hat{g}(X^*) = X^*\hat{\beta}\).

GBM-AFT {#mod-gbmaft}\ Schmid and Hothorn (2008) (Schmid and Hothorn 2008b) published a GBM for accelerated failure time models in response to PH-boosted models that may not be suitable for non-PH data. Their model fits into the GBM framework by assuming a fully-parametric AFT and simultaneously estimating the linear predictor, \(\hat{g}(X_i) =\hat{\eta}\), and the scale parameter, \(\hat{\sigma}\), controlling the amount of noise in the distribution. The (fully-parametric) AFT is defined by \[ \log Y = \eta + \sigma W \] where \(W\) is a random variable independent of the covariates that follows a given distribution and controls the noise in the model. By assuming a distribution on \(W\), a distribution is assumed for the full parametric model. The full likelihood, \(\mathcal{L}\), is given by \[ \mathcal{L}(\mathcal{D}_{train}|\mu, \sigma, W) = \prod^n_{i=1} \Big[\frac{1}{\sigma} f_W\Big(\frac{\log(T_i) - \mu}{\sigma}\Big)\Big]^{\Delta_i}\Big[S_W\Big(\frac{\log(T_i) - \mu}{\sigma}\Big)\Big]^{(1-\Delta_i)} \tag{15.3}\] where \(f_W, S_W\) is the pdf and survival function of \(W\) for a given distribution. By setting \(\mu := g(X_i)\), \(\sigma\) is then rescaled according to known results depending on the distribution (Klein and Moeschberger 2003). The gradient of the negative log-likelihood, \(-l\), is minimised in the \(m\)th iteration where \[ \begin{split} l(\mathcal{D}_{train}|\hat{g}, \hat{\sigma},W) = \sum^n_{i=1} \Delta_i\Big[- \log\sigma + \log f_W\Big(\frac{\log(T_i) - \hat{g}_{m-1}(X_i)}{\hat{\sigma}_{m-1}}\Big)\Big] + \\ (1-\Delta_i)\Big[\log S_W\Big(\frac{\log(T_i) - \hat{g}_{m-1}(X_i)}{\hat{\sigma}_{m-1}}\Big)\Big] \end{split} \] where \(\hat{g}_{m-1}, \hat{\sigma}_{m-1}\) are the location-scale parameters estimated in the previous iteration. Note this key difference to other GBM methods in which two estimates are made in each iteration step. In order to allow for this, (4) is run as normal but in addition, after updating \(\hat{g}_m\), one then updates \(\hat{\sigma}_m\) as \[ \hat{\sigma}_m := \operatornamewithlimits{argmin}_\sigma -l(\mathcal{D}_{train}|g_m,\sigma, W) \] \(\sigma_0\) is initialized at the start of the algorithm with \(\sigma_0 = 1\) suggested (Schmid and Hothorn 2008b).

This algorithm provides a ranking prediction without enforcing an often-unrealistic PH assumption on the data. This model is implemented in \(\textbf{mboost}\) and \(\textbf{xgboost}\). Experiments indicate that this may outperform the Cox PH (Schmid and Hothorn 2008b). Moreover the model has the same transparency and accessibility as the GBM-COX.

GBM-GEH {#mod-gbmgeh}\ The concordance index is likely the most popular measure of discrimination, this in part due to the fact that it makes little-to-no assumptions about the data (Chapter 6). A less common measure is the Gehan loss, motivated by the semi-parametric AFT. Johnson and Long proposed the GBM with Gehan loss, here termed GBM-GEH, to optimise separation within an AFT framework (Johnson and Long 2011).

The semi-parametric AFT is defined by the linear model, \[ \log Y = \eta + \epsilon \] for some error term, \(\epsilon\).

The D-dimensional Gehan loss to minimise is given by, \[ G_D(\mathcal{D}_{train}, \hat{g}) = -\frac{1}{n^2} \sum^n_{i=1}\sum^n_{j=1} \Delta_i (\hat{e}_i - \hat{e}_j)\mathbb{I}(\hat{e}_i \leq \hat{e}_j) \] where \(\hat{e}_i = \log T_i - \hat{g}(X_i)\). The negative gradient of the loss is, \[ r_{im} := \frac{\sum^n_{j=1} \Delta_j \mathbb{I}(\hat{e}_{m-1,i} \geq \hat{e}_{m-1,j}) -\Delta_i\mathbb{I}(\hat{e}_{m-1,i} \leq \hat{e}_{m-1,j})}{n} \] where \(\hat{e}_{m-1,i} = \log T_i - \hat{g}_{m-1}(X_i)\).

  1. then follows naturally substituting the loss and gradient above. The algorithm is implemented in \(\textbf{mboost}\). Simulation studies on the performance of the model are inconclusive (Johnson and Long 2011) however the results in (Sonabend 2021) indicate strong predictive performance.

GBM-BUJAR {#mod-gbmbujar}\ GBM-BUJAR is another boosted semi-parametric AFT. However the algorithm introduced by Wang and Wang (2010) (Wang and Wang 2010) uses Buckley-James imputation and minimisation. This algorithm is almost identical to a regression GBM (i.e. using squared loss or similar for \(L\)), except with one additional step to iteratively impute censored survival times. Assuming a semi-parametric AFT model, the GBM-BUJAR algorithm iteratively updates imputed outcomes with the Buckley-James estimator (Buckley and James 1979), \[ T^*_{m, i} := \hat{g}_{m-1}(X_i) + e_{m-1, i}\Delta_i + (1-\Delta_i)\Big[\hat{S}_{KM}(e_{m-1, i})^{-1}\sum_{e_{m-1, j} > e_{m-1, i}} e_{m-1, j}\Delta_j \hat{p}_{KM}(e_{m-1, j})\Big] \] where \(\hat{g}_{m-1}(X_i) = \hat{\eta}_{m-1}\), and \(\hat{S}_{KM}, \hat{p}_{KM}\) are Kaplan-Meier estimates of the survival and probability mass functions respectively fit on some training data, and \(e_{m-1,i} := \log(T_i) - g_{m-1}(X_i)\). Once \(T^*_{m, i}\) has been updated, (4) continues from with least squares as with any regression model.

GBM-BUJAR is implemented in \(\textbf{bujar}\) (Wang 2019) though without a separated fit/predict interface, its accessibility is therefore limited. There is no evidence of wide usage of this algorithm nor simulation studies demonstrating its predictive ability. Finally, there are many known problems with semi-parametric AFT models and the Buckey-James procedure (Wei 1992), hence GBM-BUJAR is also not transparent.

GBM-UNO {#mod-gbmuno}\ Instead of optimising models based on a given model form, Chen \(\textit{et al.}\) (Y. Chen et al. 2013) studied direct optimisation of discrimination by Harrell’s C whereas Mayr and Schmid (Mayr and Schmid 2014) focused instead on Uno’s C. Only an implementation of the Uno’s C method could be found, this is therefore discussed here and termed ‘GBM-UNO’.

The GBM-UNO attempts to predict \(\hat{g}(X^*) := \hat{\eta}\) by optimising Uno’s C (Section 6.1), \[ C_U(\hat{g}, \mathcal{D}_{train}) = \frac{\sum_{i \neq j}\Delta_i\{\hat{G}_{KM}(T_i)\}^{-2}\mathbb{I}(T_i < T_j)\mathbb{I}(\hat{g}(X_i) >\hat{g}(X_j))}{\sum_{i \neq j}\Delta_i\{\hat{G}_{KM}(T_i)\}^{-2}\mathbb{I}(T_i < T_j)} \]

The GBM algorithm requires that the chosen loss, here \(C_U\), be differentiable with respect to \(\hat{g}(X)\), which is not the case here due to the indicator term, \(\mathbb{I}(\hat{g}(X_i) > \hat{g}(X_j))\). Therefore a smoothed version is instead considered where the indicator is approximated by the sigmoid function (Ma and Huang 2006),

\[ K(u|\sigma) = (1 + \exp(-u/\sigma))^{-1} \]

where \(\sigma\) is a hyper-parameter controlling the smoothness of the approximation. The measure to optimise is then,

\[ C_{USmooth}(\mathcal{D}_{train}|\sigma) = \sum_{i \neq j} \frac{k_{ij}}{1 + \exp\big[(\hat{g}(X_j) - \hat{g}(X_i))/\sigma)\big]} \tag{15.4}\]

with

\[ k_{ij} = \frac{\Delta_i (\hat{G}_{KM}(T_i))^{-2}\mathbb{I}(T_i < T_j)}{\sum^n_{i \neq j} \Delta_i(\hat{G}_{KM}(T_i))^{-2}\mathbb{I}(T_i < T_j)} \]

The negative gradient at iteration \(m\) for observation \(i\) can then be found,

\[ r_{im} := - \sum^n_{j = 1} k_{ij} \frac{-\exp(\frac{\hat{g}_{m-1}(X_j) - \hat{g}_{m-1}(X_i)}{\sigma})}{\sigma(1 + \exp(\frac{\hat{g}_{m-1}(X_j) - \hat{g}_{m-1}(X_i)}{\sigma}))} \tag{15.5}\]

  1. can then be followed exactly by substituting this loss and gradient; this is implemented in \(\textbf{mboost}\). One disadvantage of GBM-UNO is that C-index boosting is more insensitive to overfitting than other methods (Mayr, Hofner, and Schmid 2016), therefore stability selection (Meinshausen and Bühlmann 2010) can be considered for variable selection; this is possible with \(\textbf{mboost}\). Despite directly optimising discrimination, simulation studies do not indicate that this model has better separation than other boosted or lasso models (Mayr and Schmid 2014). GBM-UNO has the same accessibility, transparency, and performance (Sonabend 2021) as previous boosting models.

15.1.3 Conclusions

Componentwise gradient boosting machines are a highly flexible and powerful machine learning tool. They have proven particularly useful in survival analysis as minimal adjustments are required to make use of off-shelf software. The flexibility of the algorithm allows all the models above to be implemented in very few \(\textsf{R}\) (and other programming languages) packages.

Boosting is a method that often relies on intensive computing power and therefore dedicated packages, such as \(\textbf{xgboost}\) (T. Chen et al. 2020), exist to push CPU/GPUs to their limits in order to optimise predictive performance. This can be viewed as a strong advantage though one should be careful not to focus too much on predictive performance to the detriment of accessibility and transparency.

Boosting, especially with tree learners, is viewed as a black-box model that is increasingly difficult to interpret as the number of iterations increase. However, there are several methods for increasing interpretability, such as variable importance and SHAPs (Lundberg and Lee 2017). There is also evidence that boosting models can outperform the Cox PH (Schmid and Hothorn 2008b) (not something all ML models can claim).