14  Support Vector Machines

Abstract
TODO (150-200 WORDS)
Major changes expected!

This page is a work in progress and major changes will be made over time.

14.0.1 SVMs for Regression

In the simplest explanation, support vector machines (SVMs) (Cortes and Vapnik 1995) fit a hyperplane, \(g\), on given training data and make predictions for new values as \(\hat{g}(X^*)\) for some testing covariate \(X^*\). One may expect the hyperplane to be fit so that all training covariates would map perfectly to the observed labels (a ‘hard-boundary’) however this would result in overfitting and instead an acceptable (‘soft’-)boundary of error, the `\(\epsilon\)-tube’, dictates how ‘incorrect’ predictions may be, i.e. how large an underestimate or overestimate. (Figure 14.1) visualises support vector machines for regression with a linear hyperplane \(g\), and an acceptable boundary of error within the dashed lines (the \(\epsilon\)-tube). SVMs are not limited to linear boundaries and kernel functions are utilised to specify more complex hyperplanes. Exact details of the optimization/separating procedure are not discussed here but many off-shelf ‘solvers’ exist in different programming languages for fitting SVMs.

In the regression setting, the goal of SVMs is to estimate the function \[ g: \mathbb{R}^p \rightarrow \mathbb{R}; \quad (x) \mapsto x\beta + \beta_0 \tag{14.1}\] by estimation of the weights \(\beta \in \mathbb{R}^p, \beta_0 \in \mathbb{R}\) via the optimisation problem \[ \begin{aligned} & \min_{\beta,\beta_0, \xi, \xi^*} \frac{1}{2} \|\beta\|^2 + C \sum^n_{i=1}(\xi_i + \xi_i^*) \\ & \textrm{subject to} \begin{dcases} Y_i - g(X_i) & \leq \epsilon + \xi_i \\ g(X_i) - Y_i & \leq \epsilon + \xi_i^* \\ \xi_i, \xi_i^* & \geq 0, \ i = 1,...,n \end{dcases} \end{aligned} \tag{14.2}\] where \(C \in \mathbb{R}\) is the regularization/cost parameter, \(\xi_i,\xi_i^*\) are slack parameters and \(\epsilon\) is a margin of error for observations on the wrong side of the hyperplane, and \(g\) is defined in (Equation 14.1). The effect of the slack parameters is seen in (Figure 14.1) in which a maximal distance from the \(\epsilon\)-tube is dictated by the slack variables.

In fitting, the dual of the optimisation is instead solved and substituting the optimised parameters into (Equation 14.1) gives the prediction function, \[ \hat{g}(X^*) = \sum^n_{i=1} (\alpha_i - \alpha_i^*)K(X^*,X_i) + \beta_0 \] where \(\alpha_i, \alpha_i^*\) are Lagrangrian multipliers and \(K\) is some kernel function. The Karush-Kuhn-Tucker conditions required to solve the optimisation for \(\alpha\) result in the key property of SVMs, which is that values \(\alpha_i = \alpha_i^* = 0\) indicate that observation \(i\) is ‘inside’ the \(\epsilon\)-tube and if \(\alpha_i \neq 0\) or \(\alpha^*_i \neq 0\) then \(i\) is outside the tube and termed a support vector. It is these ‘support vectors’ that influence the shape of the separating boundary.

The choice of kernel and its parameters, the regularization parameter \(C\), and the acceptable error \(\epsilon\), are all tunable hyper-parameters, which makes the support vector machine a highly adaptable and often well-performing machine learning method. However the parameters \(C\) and \(\epsilon\) often have no clear apriori meaning (especially true when predicting abstract rankings) and thus require extensive tuning over a great range of values; no tuning will result in a very poor model fit.

TODO
Figure 14.1: Visualising a support vector machine with an \(\epsilon\)-tube and slack parameters \(\xi\) and \(\xi^*\). Red circles are values within the \(\epsilon\)-tube and blue diamonds are values outside the tube. x-axis is single covariate, \(x\), and y-axis is \(g(x) = x\beta + \beta_0\).

14.0.2 SVMs for Survival Analysis

Similarly to random forests, all research for Survival Support Vector Machines (SSVMs) can be reduced to very few algorithms, in fact only one unique off-shelf algorithm is identified in this survey. No SSVM for distribution predictions exist, instead they either predict survival time, rankings, or a hybrid of the two.

Other reviews and surveys of SSVMs include a short review by Wang \(\textit{et al.}\) (2017) (Wang, Li, and Reddy 2019) and some benchmark experiments and short surveys from Van Belle \(\textit{et al.}\) (2011) (Vanya Van Belle, Pelckmans, Van Huffel, et al. 2011), Goli \(\textit{et al.}\) (2016) (Goli, Mahjub, Faradmal, and Soltanian 2016) and Fouodo \(\textit{et al.}\) (2018) (Fouodo et al. 2018). All the benchmark experiments in these papers indicate that the Cox PH performs as well as, if not better than, the SSVMs.

Initial attempts at developing SSVMs by Shivaswamy \(\textit{et al.}\) (2007) (Shivaswamy, Chu, and Jansche 2007) took the most ‘natural’ course and attempt to treat the problem as a regression one with adjustments in the optimisation for censoring. These methods have a natural interpretation and are intuitive in their construction. Further development of these by Khan and Zubek (2008) (Khan and Bayer Zubek 2008) and Land \(\textit{et al.}\) (2011) (Land et al. 2011) focused on different adjustments for censoring in order to best reflect a realistic survival data set-up. Simultaneously, ranking models were developed in order to directly optimise a model’s discriminatory power. Developments started with the work of Evers and Messow (2008) (Evers and Messow 2008) but were primarily made by Van Belle \(\textit{et al.}\) (2007)-(2011) (V. Van Belle et al. 2010; Vanya Van Belle et al. 2007, 2008; Vanya Van Belle, Pelckmans, Suykens, et al. 2011). These lack the survival time interpretation but are less restrictive in the optimisation constraints. Finally a hybrid of the two followed naturally from Van Belle \(\textit{et al.}\) (2011) (Vanya Van Belle, Pelckmans, Van Huffel, et al. 2011) by combining the constraints from both the regression and ranking tasks. This hybrid method allows a survival time interpretation whilst still optimising discrimination. These hybrid models have become increasingly popular in not only SSVMs, but also neural networks (Section 16.1). Instead of presenting these models chronologically, the final hybrid model is defined and then other developments can be more simply presented as components of this hybrid. One model with an entirely different formulation is considered after the hybrid.

For all SSVMs defined in this section let: \(\xi_i,\xi_i^*,\xi_i'\) be slack variables; \(\beta,\beta_0\) be model weights in \(\mathbb{R}\); \(C, \mu\) be regularisation hyper-parameters in \(\mathbb{R}\); \((X_i, T_i, \Delta_i) \stackrel{i.i.d.}\sim(X,T,\Delta)\) be the usual training data; and \(g(x) = x\beta + \beta_0\).

14.0.2.1 SSVM-Hybrid {.unnumbered .unlisted}

Van Belle \(\textit{et al.}\) published several papers developing SSVMs, which culminate in the hybrid model here termed ‘SSVM-Hybrid’ (Vanya Van Belle, Pelckmans, Van Huffel, et al. 2011). The model is defined by the optimisation problem,

SSVM-Hybrid\ \[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi', \xi^*} \frac{1}{2}\|\beta\|^2 + C\sum_{i =1}^n \xi_i + \mu \sum^n_{i=1}(\xi_i' + \xi_i^*) \\ & \textrm{subject to} \begin{dcases} & g(X_i) - g(X_{j(i)}) \geq T_i - T_{j(i)} - \xi_i, \\ & \Delta_i(g(X_i) - T_i) \leq \xi^*_i \\ & T_i - g(X_i) \leq \xi'_i \\ & \xi_i, \xi_i', \xi_i^* \geq 0, \quad \forall i = 1,...,n \\ \end{dcases} \end{aligned} \label{eq:surv_ssvmvb2} \]

where \(j(i) := \operatornamewithlimits{argmax}_{j \in 1,...n} \{T_j : T_j < T_i\}\) is an index discussed further below. A prediction for test data is given by,

\[ \hat{g}(X^*) = \sum^n_{i=1} \alpha_i(K(X_i, X^*) - K(X_{j(i)}, X^*)) + \alpha^*_i K(X_i, X^*) - \Delta_i\alpha_i'K(X_i, X^*) + \beta_0 \]

where \(\alpha_i, \alpha_i^*, \alpha_i'\) are Lagrange multipliers and \(K\) is a chosen kernel function, which may have hyper-parameters to select or tune.

SVCR (Regression)

Examining the components of the SSVM-Hybrid model will help identify its relation to previously published SSVMs. First note the model’s connection to the regression setting when on setting \(C = 0\), removing the associated first constraint and ignoring \(\Delta\) in the second constraint, the regression setting is exactly recovered:

\[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi'} \frac{1}{2}\|\beta\|^2 + \mu \sum^n_{i=1}(\xi_i + \xi_i') \\ & \textrm{subject to} \begin{dcases} & g(X_i) - T_i \leq \xi_i \\ & T_i - g(X_i) \leq \xi'_i \\ & \xi_i, \xi_i' \geq 0, \quad \forall i = 1,...,n \\ \end{dcases} \end{aligned} \]

Note a slight difference in the formulation of this optimisation to the original regression problem, here no error component \(\epsilon\) is directly included, instead this is part of the optimisation and considered as part of the slack parameters \(\xi_i, \xi'_i\); effectively this is the same as setting \(\epsilon = 0\). This formulation removes the \(\epsilon\)-tube symmetry seen previously and therefore distinguishes more clearly between overestimates and underestimates, with each being penalised differently. Removing the \(\epsilon\) parameter can lead to model overfitting as all points become support vectors, however careful tuning of other hyper-parameters can effectively control for this.

This formulation allows for clearer control over left-, right-, and un-censored observations. Clearly if an observation is uncensored then the true value is known and should be predicted exactly, hence under- and over-estimates are equally problematic and should be penalised the same. If an observation is right-censored then the true death time is greater than the observed time and therefore overestimates should not be heavily penalised but underestimates should be; conversely for left-censored observations.

This leads to the first SSVM for regression from Shivaswamy \(\textit{et al.}\) (2007) (Shivaswamy, Chu, and Jansche 2007).

SVCR \[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi^*} \frac{1}{2}\|\beta\|^2 + \mu\Big(\sum_{i \in R} \xi_i + \sum_{i \in L} \xi_i^*\Big) \\ & \textrm{subject to} \begin{dcases} & g(X_i) - T_i \leq \xi^*_i, \quad \forall i \in R \\ & T_i - g(X_i) \leq \xi_i, \quad \forall i \in L \\ & \xi_i \geq 0, \forall i\in R; \xi^*_i \geq 0, \forall i \in L \end{dcases} \end{aligned} \]

where \(L\) is the set of observations who are either left- or un-censored, and \(R\) is the set of observations who are either right- or un-censored. Hence an uncensored observation is constrained on both sides as their true survival time is known, whereas a left-censored observation is constrained in the amount of ‘over-prediction’ and a right-censored observation is constrained by ‘under-prediction’. This is intuitive as the only known for these censoring types are the lower and upper bounds of the actual survival time respectively.

Reducing this to the book scope of right-censoring only results in the optimisation:

\[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi^*} \frac{1}{2}\|\beta\|^2 + \mu\Big(\sum_{i = 1}^n \xi_i + \xi_i^*\Big) \\ & \textrm{subject to} \begin{dcases} & \Delta_i(g(X_i) - T_i) \leq \xi_i \\ & T_i - g(X_i) \leq \xi^*_i \\ & \xi_i, \xi_i^* \geq 0 \\ & \forall i\in 1,...,n \end{dcases} \end{aligned} \] which can be seen to be identical to SSVM-Hybrid when \(C=0\) and the first constraint is removed. Predictions are found by,

\[ \hat{g}(X^*) = \sum^n_{i=1} \alpha^*_i K(X_i, X^*) - \Delta_i\alpha_i'K(X_i, X^*) + \beta_0 \]

The advantage of this algorithm is its simplicity. Clearly if no-one is censored then the optimisation is identical to the regression optimisation in (Equation 14.2). As there is no \(\epsilon\) hyper-parameter, the run-time complexity is the same as, if not quicker than, a regression SVM. Both left- and right-censoring are handled and no assumptions are made about independent censoring. With respect to performance, benchmark experiments (Fouodo et al. 2018) indicate that the SVCR does not outperform a na"ive SVR (i.e. censoring ignored). The SVCR is implemented in the \(\textsf{R}\) package \(\textbf{survivalsvm}\) (Fouodo et al. 2018) and is referred to as ‘regression’.

As discussed, the error margin for left- and right- censoring should not necessarily be equal and the penalty for each should not necessarily be equal either. Hence a natural extension to SVCR is to add further parameters to better separate the different censoring types, which gives rise to the SVRc (Khan and Bayer Zubek 2008). However this model is only briefly discussed as left-censoring is out of scope of this book and also the model is patented and therefore not easily accessible. The model is given by the optimisation,

SVRc \[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi^*} \frac{1}{2}\|\beta\|^2 + \sum^n_{i=1} C_i\xi_i + C^*_i\xi'_i \\ & \textrm{subject to} \begin{dcases} & g(X_i) - T_i \leq \epsilon'_i + \xi'_i \\ & T_i - g(X_i) \leq \epsilon_i + \xi_i \\ & \xi_i, \xi_i' \geq 0, \quad \forall i = 1,...,n \\ \end{dcases} \end{aligned} \]

Where \(C_i = \Delta_iC_c + (1-\Delta_i)C_n, \epsilon_i = \Delta_i\epsilon_c + (1-\Delta_i)\epsilon_n\) and analogously for \(C^*_i, C_C^*, \epsilon^*,...\). The new hyper-parameters \(C_c, C_n, \epsilon_c, \epsilon_n\) are the penalty for errors in censored predictions (c) and uncensored predictions (n) for left and right (*) censoring, and the acceptable margin of errors respectively. The rationale behind this algorithm is clear, by having asymmetric error margins the algorithm can penalise predictions that are clearly wrong whilst allowing predictions that may be correct (but ultimately unknown due to censoring). Experiments indicate the model may have superior discrimination than the Cox PH (Khan and Bayer Zubek 2008) and SVCR (Du and Dua 2011). However these conclusions are weak as independent experiments do not have access to the patented model.

The largest drawback of the algorithm is a need to tune eight parameters. As the number of hyper-parameters to tune increases, so too does model fitting time as well as the risk of overfitting. The problem of extra hyper-parameters is the most common disadvantage of the model given in the literature (Fouodo et al. 2018; Land et al. 2011). Land \(\textit{et al.}\) (2011) (Land et al. 2011) present an adaptation to the SVRc to improve model fitting time, termed the EP-SVRc, which uses Evolutionary Programming to determine the optimal values for the parameters. No specific model or algorithm is described, nor any quantitative results presented. No evidence can be found for this method being used since publication. The number of hyper-parameters in the SVRc, coupled with its lack of accessibility, outweigh the benefits of the claimed predictive performance and is therefore clearly not accessible.

14.0.2.2 SSVM-Rank {.unnumbered .unlisted}

The regression components of SSVM-Hybrid (\(\ref{eq:surv_ssvmvb2}\)) have been fully examined, now turning to the ranking components and setting \(\mu = 0\). In this case the model reduces to

SSVM-Rank \[ \begin{aligned} & \min_{\beta, \beta_0, \xi} \frac{1}{2}\|\beta\|^2 + C\sum_{i =1}^n \xi_i \\ & \textrm{subject to} \begin{dcases} & g(X_i) - g(X_{j(i)}) \geq T_i - T_{j(i)} - \xi_i, \\ & \xi_i \geq 0, \quad \forall i = 1,...,n \\ \end{dcases} \end{aligned} \]

with predictions

\[ \hat{g}(X^*) = \sum^n_{i=1} \alpha_i(K(X_i, X^*) - K(X_{j(i)}, X^*)) \]

This formulation, termed here ‘SSVM-Rank’, has been considered by numerous authors in different forms, including Evers and Messow (Evers and Messow 2008) and Van Belle \(\textit{et al.}\) (Vanya Van Belle et al. 2007, 2008; Vanya Van Belle, Pelckmans, Van Huffel, et al. 2011). The primary differences between the various models are in which observations are compared in order to optimise discrimination; to motivate why this matters, first observe the intuitive nature of the optimisation constraints. By example, define \(k := T_i - T_{j(i)}\) and say \(T_i > T_{j(i)}\). Then, in the first constraint, \(g(X_i) - g(X_{j(i)}) \geq k - \xi_i\). As \(k > 0\) and \(\xi_i \geq 0\), it follows that \(g(X_i) > g(X_{j(i)})\), hence creating a concordant ranking which is the opposite to the between observations \(i\) (ranked higher) and \(j(i)\); illustrating why this optimisation results in a ranking model.

This choice of comparing observations \(i\) and \(j(i)\) (defined below) stems from a few years of research in an attempt to optimise the algorithm with respect to both speed and predictive performance. In the original formulation, RANKSVMC (Vanya Van Belle et al. 2007), the model ranks all possible pairs of observations. This is clearly infeasible as it increases the problem to a \(\mathcal{O}(qn^2/2)\) runtime where \(q\) is the proportion of non-censored observations out of a total sample size \(n\) (Vanya Van Belle et al. 2008). The problem was reduced by taking a nearest neighbours approach and only considering the \(k\)th closest observations (Vanya Van Belle et al. 2008). Simulation experiments determined that the single nearest neighbour was sufficient, thus arriving at \(j(i)\), the observation with the largest observed survival time smaller than \(T_i\), \[ j(i) := \operatornamewithlimits{argmax}_{j \in 1,...n} \{T_j : T_j < T_i\} \]

This requires that the first observation is taken to be an event, even if it is actually censored. In practice, sorting observations by survival time then greatly speeds up the model run-time (Fouodo et al. 2018). The RANKSVMC and SSVM-RANK are implemented in \(\textbf{survivalsvm}\) (Fouodo et al. 2018) and referred to as ‘vanbelle1’ and ‘vanbelle2’ respectively.

The hybrid model is repeated below with the ranking components in blue, the regression components in red, and the common components in black, clearly highlighting the composite nature of the model.

\[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi', \xi^*} \frac{1}{2}\|\beta\|^2 + \textcolor{blue}{C\sum_{i =1}^n \xi_i} + \textcolor{red}{\mu \sum^n_{i=1}(\xi_i' + \xi_i^*)} \\ & \textrm{subject to} \begin{dcases} & \textcolor{blue}{g(X_i) - g(X_{j(i)}) \geq T_i - T_{j(i)} - \xi_i} \\ & \textcolor{red}{\Delta_i(g(X_i) - T_i) \leq \xi^*_i} \\ & \textcolor{red}{T_i - g(X_i) \leq \xi'_i} \\ & \textcolor{blue}{\xi_i}, \textcolor{red}{\xi_i', \xi_i^*} \geq 0, \quad \forall i = 1,...,n \\ \end{dcases} \end{aligned} \]

and predictions are made with,

\[ \hat{g}(X^*) = \sum^n_{i=1} \textcolor{blue}{\alpha_i(K(X_i, X^*) - K(X_{j(i)}, X^*))} + \textcolor{red}{\alpha^*_i K(X_i, X^*) - \Delta_i\alpha_i'K(X_i, X^*)} + \beta_0 \]

The regularizer hyper-parameters \(C\) and \(\mu\) now have a clear interpretation. \(C\) is the penalty associated with the regression method and \(\mu\) is the penalty associated with the ranking method. By always fitting the hybrid models and tuning these two parameters, there is never a requirement to separately fit the regression or ranking methods as these would be automatically identified as superior in the tuning procedure. Moreover, the hybrid model retains the interpretability of the regression method and predictions can be interpreted as survival times. The hybrid method is implemented in \(\textbf{survivalsvm}\) as ‘hybrid’. By Van Belle’s own simulation studies, these models do not outperform the Cox PH with respect to Harrell’s C.

SSVR-MRL

Not all SSVMs can be considered a variant of the SSVM-Hybrid, though all prominent and commonly utilised suggestions do seem to have this formulation. One other algorithm of note is termed here the ‘SSVM-MRL’ (Goli, Mahjub, Faradmal, and Soltanian 2016; Goli, Mahjub, Faradmal, Mashayekhi, et al. 2016), which is a regression SSVM. The algorithm is identical to SVCR with one additional constraint.

SSVR-MRL\ \[ \begin{aligned} & \min_{\beta, \beta_0, \xi, \xi^*,\xi'} \frac{1}{2}\|\beta\|^2 + C\sum^n_{i=1} (\xi_i + \xi_i^*) + C^*\sum^n_{i=1} \xi_i' \\ & \textrm{subject to} \begin{dcases} & T_i - g(X_i) \leq \xi_i \\ & \Delta_i(g(X_i) - T_i) \leq \xi_i^* \\ & (1 - \Delta_i)(g(X_i) - T_i - MRL(T_i|\hat{S})) \leq \xi_i' \\ & \xi_i, \xi_i^*, \xi_i' \geq 0 \\ & \forall i = 1,...,n \end{dcases} \end{aligned} \]

where \(MRL(T_i|\hat{S})\) is the ‘mean residual lifetime’ function (Klein and Moeschberger 2003) \[ MRL(\tau|\hat{S}) = \frac{\int^\infty_\tau \hat{S}(u) du}{\hat{S}(\tau)} \] which is the area under the estimated survival curve (say by Kaplan Meier), \(\hat{S}\), from point \(\tau\), weighted by the probability of being alive at point \(\tau\). This is interpreted as the expected remaining lifetime from point \(\tau\). On setting \(C^* = 0\) and removing associated constraint three, this reduces exactly to the SVCR and similarly if there’s no censoring then the standard regression setting is recovered. Unlike other strategies, no new hyper-parameters are introduced and Kaplan-Meier estimation should not noticeably impact run-time. There is no evidence of this model being used in practice, nor of any off-shelf implementation. Theoretically, the hybrid model could be expanded to include this extra penalty term and constraint (discussed below).

14.0.3 Conclusions

Several SSVMs have been proposed for survival analysis. These can generally be categorised into ‘regression’ models that adapt SVMs to account for censoring and predict a survival time, ‘ranking’ models that predict a relative ranking in order to optimise measures of discrimination, and ‘hybrid’ models that optimise measures of discrimination but make survival time predictions. Other SSVMs that lie outside of these groupings are not able to solve the survival task (e.g. (Shiao and Cherkassky 2013)). Other SVM-type approaches could be considered, including relevance vector machines and import vector machines, however less work has been developed in these areas and further consideration is beyond the scope of this book.

The models that have received the most attention are SVCR, SSVM-Rank, and SSVM-Hybrid; the first two are special cases of SSVM-Hybrid. Judging if SSVM-Hybrid (and by extension SVCR and SSVM-Rank) is accessible and transparent is not straightforward. On the one hand it could be considered transparent as SVMs have been studied for decades and the literature for SSVMs, especially from Van Belle, is extensive. On the other hand, the predictions from SSVM-Hybrid should be interpretable as survival times but first hand experience indicates that this is not the case (though this may be due to implementation), which calls into question whether the interpretation they claim to have is actually correct. For accessibility, there appears to be only one implementation of SSVMs in \(\textsf{R}\) (Fouodo et al. 2018), and also only one in Python (Pölsterl 2020), which may be due to SSVMs being difficult to implement, even when several optimisation solvers exist off-shelf. Finally, there is no evidence that SSVMs outperform the Cox PH or baseline models and moreover they often perform worse (Fouodo et al. 2018; Vanya Van Belle, Pelckmans, Van Huffel, et al. 2011), which is also seen in (Sonabend 2021). Yet one cannot dismiss SSVMs outright as they often require extensive tuning to perform well, even in classification settings, and no benchmark experiment has yet to emerge for testing SSVMs with the required set-up.