15 Support Vector Machines

Abstract

This chapter introduces support vector machines (SVMs) for regression and then describes the extensions to survival analysis. Regression SVMs extend simple linear methods by estimating flexible, non-linear hyperplanes that minimise the difference between predictions and the truth for individual observations. In survival analysis, SVMs may make survival time or ranking predictions, however there is no current formulation for survival distribution predictions. The chapter begins by discussing survival time SVMs and then ranking models before concluding with a hybrid formulation that combines both model forms. This primarily covers the work of Shivaswamy and Van Belle. SVMs are a powerful method for estimating non-linear relationships in data and have proven to be well-performing models in regression and classification. However, SVMs are less developed in survival analysis and have been shown to perform worse than other models in experiments.

Minor changes expected!

This page is a work in progress and minor changes will be made over time.

Support vector machines are a popular class of models in regression and classification settings due to their ability to make accurate predictions for complex high-dimensional, non-linear data. Survival support vector machines (SSVMs) predict continuous responses that can be used as ranking predictions with some formulations that provide survival time interpretations. This chapter starts with SVMs in the regression setting before moving to adaptions for survival analysis.

15.1 SVMs for Regression

In simple linear regression, the aim is to estimate the line \(y = \alpha + x\beta_1\) by estimating the \(\alpha,\beta_1\) coefficients. As the number of coefficients increases, the goal is to instead estimate the hyperplane, which divides the higher-dimensional space into two separate parts. To visualize a hyperplane, imagine looking at a room from a birds eye view that has a dividing wall cutting the room into two halves (Figure 15.1). In this view, the room appears to have two dimensions (x=left-right, y=top-bottom) and the divider is a simple line of the form \(y = \alpha + x\beta_1\). In reality, this room is actually three dimensional and has a third dimension (z=up-down) and the divider is therefore a hyperplane of the form \(y = \alpha + x\beta_1 + z\beta_2\).

AI generated image of a birds-eye view of an office cut in half by a dividing wall. — Figure 15.1: Visualising a hyperplane by viewing a 3D room in two-dimensions with a wall that is now seen as a simple line. When standing in this room, the wall will clearly exist in three dimensional space.

Continuing the linear regression example, consider a simple model where the objective is to find the \(\boldsymbol{\beta} = ({\beta}_1 \ {\beta}_2 \cdots {\beta}_{p})^\intercal\) coefficients that minimize \(\sum^n_{i=1} (g(\mathbf{x}_i) - y_i)^2\) where \(g(\mathbf{x}_i) = \alpha + \mathbf{x}_i^\intercal\boldsymbol{\beta}\) and \((\mathbf{X}, \mathbf{y})\) is training data such that \(\mathbf{X}\in \mathbb{R}^{n \times p}\) and \(\mathbf{y}\in \mathbb{R}^n\). In a higher-dimensional space, a penalty term can be added for variable selection to reduce model complexity, commonly of the form

\[ \frac{1}{2} \sum^n_{i=1} (g(\mathbf{x}_i) - y_i)^2 + \frac{\lambda}{2} \|\boldsymbol{\beta}\|^2 \]

for some penalty term \(\lambda \in \mathbb{R}\). Minimizing this error function effectively minimizes the average difference between all predictions and true outcomes, resulting in a hyperplane that represents the best linear relationship between coefficients and outcomes.

Similarly to linear regression, support vector machines (SVMs) (Cortes and Vapnik 1995) also fit a hyperplane, \(g\), on given training data, \(\mathbf{X}\). However, in SVMs, the goal is to fit a flexible (non-linear) hyperplane that minimizes the difference between predictions and the truth for individual observations. A core feature of SVMs is that one does not try to fit a hyperplane that makes perfect predictions as this would overfit the training data and is unlikely to perform well on unseen data. Instead, SVMs use a regularized error function, which allows incorrect predictions (errors) for some observations, with the magnitude of error controlled by an \(\epsilon>0\) parameter as well as slack parameters, \(\boldsymbol{\xi'} = ({\xi'}_1 \ {\xi'}_2 \cdots {\xi'}_{n})^\intercal\) and \(\boldsymbol{\xi^*} = ({\xi^*}_1 \ {\xi^*}_2 \cdots {\xi^*}_{n})^\intercal\) :

\[ \begin{aligned} & \min_{\boldsymbol{\beta},\alpha, \boldsymbol{\xi}', \boldsymbol{\xi}^*} \frac{1}{2} \|\boldsymbol{\beta}\|^2 + C \sum^n_{i=1}(\xi'_i + \xi_i^*) \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) & \geq y_i -\epsilon - \xi'_i \\ g(\mathbf{x}_i) & \leq y_i + \epsilon + \xi_i^* \\ \xi'_i, \xi_i^* & \geq 0 \\ \end{dcases} \end{aligned} \tag{15.1}\]

\(\forall i\in 1,...,n\) where \(g(\mathbf{x}_i) = \alpha + \mathbf{x}_i^\intercal\boldsymbol{\beta}\) for model weights \(\boldsymbol{\beta}\in \mathbb{R}^p\) and \(\alpha \in \mathbb{R}\) and the same training data \((\mathbf{X}, \mathbf{y})\) as above.

Figure 15.2 visualizes a support vector regression model in two dimensions. The red circles are values within the \(\epsilon\)-tube and are thus considered to have a negligible error. In fact, the red circles do not affect the fitting of the optimal line \(g\) and even if they moved around, as long as they remain within the tube, the shape of \(g\) would not change. In contrast the blue diamonds have an unacceptable margin of error – as an example the top blue diamond will have \(\xi'_i = 0\) but \(\xi_i^* > 0\), thus influencing the estimation of \(g\). Points on or outside the epsilon tube are referred to as support vectors as they affect the construction of the hyperplane. The \(C \in \mathbb{R}_{>0}\) hyperparameter controls the slack parameters and thus as \(C\) increases, the number of errors (and subsequently support vectors) is allowed to increase resulting in low variance but higher bias, in contrast a lower \(C\) is more likely to introduce overfitting with low bias but high variance (Hastie, Tibshirani, and Friedman 2001). \(C\) should be tuned to control this trade-off.

Line graph with g(x) on the y-axis and 'x' on the x-axis. A solid black line labelled 'y' runs along g(x)=x, i.e., from the bottom-left to the top-right of the graph. Parallel to this line, above and below, are two dotted lines labelled 'y+epsilon' and 'y-epsilon' respectively. These dotted lines form the boundaries of the epsilon-tube. Red dots lie between the dotted lines and blue diamonds are outside the lines. One blue dot above the top dotted line is labelled to reflect that it represents the distance xi* and one blue dot below the bottom line is the distance xi. — Figure 15.2: Visualising a support vector machine with an \(\epsilon\)-tube and slack parameters \(\boldsymbol{\xi}'\) and \(\boldsymbol{\xi}^*\). Red circles are values within the \(\epsilon\)-tube and blue diamonds are support vectors on and outside the tube. x-axis is single covariate, \(x\), and y-axis is \(g(x) = x\beta_1 + \alpha\).

The other core feature of SVMs is exploiting the kernel trick, which uses functions known as kernels to allow the model to learn a non-linear hyperplane whilst keeping the computations limited to lower-dimensional settings. Once the model coefficients have been estimated using the optimization above, predictions for a new observation \(\mathbf{x}^*\) can be made using a function of the form

\[ \hat{g}(\mathbf{x}^*) = \sum^n_{i=1} \mu_iK(\mathbf{x}^*,\mathbf{x}_i) + \alpha \tag{15.2}\]

Details (including estimation) of the \(\mu_i\) Lagrange multipliers are beyond the scope of this book, references are given at the end of this chapter for the interested reader. \(K\) is a kernel function, with common functions including the linear kernel, \(K(x^*,x_i) = \sum^p_{j=1} x_{ij}x^*_j\), radial kernel, \(K(x^*,x_i) = \exp(-\omega\sum^p_{j=1} (x_{ij} - x^*_j)^2)\) for some \(\omega \in \mathbb{R}_{>0}\), and polynomial kernel, \(K(x^*,x_i) = (1 + \sum^p_{j=1} x_{ij}x^*_j)^d\) for some \(d \in \mathbb{N}_{> 0}\).

The choice of kernel and its parameters, the regularization parameter \(C\), and the acceptable error \(\epsilon\), are all tunable hyper-parameters, which makes the support vector machine a highly adaptable and often well-performing machine learning method. The parameters \(C\) and \(\epsilon\) often have no clear apriori meaning (especially true in the survival setting predicting abstract rankings) and thus require tuning over a great range of values; no tuning usually results in a poor model fit (Probst, Boulesteix, and Bischl 2019).

15.2 SVMs for Survival Analysis

Extending SVMs to the survival domain (SSVMs) is a case of: i) identifying the quantity to predict; and ii) updating the optimization problem (Equation 15.1) and prediction function (Equation 15.2) to accommodate for censoring. In the first case, SSVMs can be used to either make survival time or ranking predictions, which are discussed in turn. The notation above is reused below for SSVMs, with additional notation introduced when required and now using the survival training data \((\mathbf{X}, \mathbf{t}, \boldsymbol{\delta})\).

15.2.1 Survival time SSVMs

To begin, consider the objective for support vector regression with the \(y\) variable replaced with the usual survival time outcome \(t\), for all \(i\in 1,...,n\):

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}', \boldsymbol{\xi}^*} \frac{1}{2} \|\boldsymbol{\beta}\|^2 + C \sum^n_{i=1}(\xi'_i + \xi_i^*) \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) & \geq t_i -\epsilon - \xi'_i \\ g(\mathbf{x}_i) & \leq t_i + \epsilon + \xi_i^* \\ \xi'_i, \xi_i^* & \geq 0 \\ \end{dcases} \end{aligned} \tag{15.3}\]

In survival analysis, this translates to fitting a hyperplane in order to predict the true survival time. However, as with all adaptations from regression to survival analysis, there needs to be a method for incorporating censoring.

Recall the \((t_l, t_u)\) notation to describe censoring as introduced in Chapter 4 such that the outcome occurs within the range \((t_l, t_u)\). Let \(\tau \in \mathbb{R}_{>0}\) be some known time-point, then an observation is:

left-censored if the survival time is less than \(\tau\): \((t_l, t_u) = (-\infty, \tau)\);
right-censored if the true survival time is greater than \(\tau\): \((t_l, t_u) = (\tau, \infty)\); or
uncensored if the true survival time is known to be \(\tau\): \((t_l, t_u) = (\tau, \tau)\).

Define \(\mathcal{L}= \{i: t_i > -\infty\}\) as the set of observations with a finite lower-bounded time, which can be seen above to be those that are right-censored or uncensored. Define \(\mathcal{U}= \{i: t_i < \infty\}\) as the analogous set of observations with a finite upper-bounded time, which are those that are left-censored or uncensored.

Consider these definitions in the context of the constraints in Equation 15.3. The first constraint ensures the hyperplane is greater than some lower-bound created by subtracting the slack parameter from the true outcome – given the set definitions above this constraint only has meaning for observations with a finite lower-bound, \(i \in \mathcal{L}\), otherwise the constraint would include \(g(\mathbf{x}_i) \geq -\infty\), which is clearly not useful. Similarly the second constraint ensures the hyperplane is less than some upper-bound, which again can only be meaningful for observations \(i \in \mathcal{U}\). Restricting the constraints in this way leads to the optimization problem (Shivaswamy, Chu, and Jansche 2007) below and visualised in Figure 15.3:

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}', \boldsymbol{\xi}^*} \frac{1}{2}\|\boldsymbol{\beta}\|^2 + C\Big(\sum_{i \in \mathcal{U}} \xi_i + \sum_{i \in \mathcal{L}} \xi_i^*\Big) \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) & \geq t_i -\xi'_i, i \in \mathcal{L}\\ g(\mathbf{x}_i) & \leq t_i + \xi^*_i, i \in \mathcal{U}\\ \xi'_i \geq 0, \forall i\in \mathcal{L}; \xi^*_i \geq 0, \forall i \in \mathcal{U} \end{dcases} \end{aligned} \]

If no observations are censored then the optimization becomes the regression optimization in (Equation 15.1). Note that in SSVMs, the \(\epsilon\) parameters are typically removed to better accommodate censoring and to help prevent the same penalization of over- and under-predictions. In contrast to this formulation, one could introduce more \(\epsilon\) and \(C\) parameters to separate between under- and over-predictions and to separate right- and left-censoring, however this leads to eight tunable hyperparameters, which is inefficient and may increase overfitting (Fouodo et al. 2018; Land et al. 2011). The algorithm can be simplified to right-censoring only by removing the second constraint completely for anyone censored:

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}', \boldsymbol{\xi}^*} \frac{1}{2}\|\boldsymbol{\beta}\|^2 + C \sum_{i = 1}^n (\xi'_i + \xi_i^*) \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) & \geq t_i - \xi^*_i \\ g(\mathbf{x}_i) & \leq t_i + \xi'_i, i:\delta_i=1 \\ \xi'_i, \xi_i^* & \geq 0 \\ \end{dcases} \end{aligned} \]

\(\forall i\in 1,...,n\). With the prediction for a new observation \(\mathbf{x}^*\) calculated as,

\[ \hat{g}(\mathbf{x}^*) = \sum^n_{i=1} \mu^*_i K(\mathbf{x}_i, \mathbf{x}^*) - \delta_i\mu_i'K(\mathbf{x}_i, \mathbf{x}^*) + \alpha \]

Where again \(K\) is a kernel function and the calculation of the Lagrange multipliers is beyond the scope of this book.

Line graph with g(x) on the y-axis and 'x' on the x-axis. A solid black line labelled 'y' runs along g(x)=x, i.e., from the bottom-left to the top-right of the graph. Red circles and blue dots lie on both sides of the line. Blue dots above the line represent observations with finite upper bounds and blue dots below the line represent observations with finite lower bounds. The blue dots are labelled to show they represent the distance xi* and xi. — Figure 15.3: Visualising a survival time SVM. Blue diamonds are influential support vectors, which are uncensored or left-censored when \(g(\mathbf{x})<t\) or uncensored or right-censored when \(g(\mathbf{x})>t\). Red circles are non-influential observations.

15.2.2 Ranking SSVMs

Support vector machines can be used to estimate rankings by penalizing predictions that result in disconcordant predictions. Recall the definition of concordance from Chapter 8: ranking predictions for a pair of comparable observations \((i, j)\) where \(t_i < t_j \cap \delta_i = 1\), are called concordant if \(r_i > r_j\) where \(r_i, r_j\) are the predicted ranks for observations \(i\) and \(j\) respectively and a higher value implies greater risk. Using the prognostic index as a ranking prediction (Section 6.2), a pair of observations is concordant if \(g(\mathbf{x}_i) > g(\mathbf{x}_j)\) when \(t_i < t_j\), leading to:

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{\beta}\|^2 + \gamma\sum_{i =1}^n \xi_i \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) - g(\mathbf{x}_j) & \geq \xi_i, \forall i,j \in CP \\ \xi_i & \geq 0, i = 1,...,n \\ \end{dcases} \end{aligned} \]

where \(CP\) is the set of comparable pairs defined by \(CP = \{(i, j) : t_i < t_j \wedge \delta_i = 1\}\). Given the number of pairs, the optimization problem quickly becomes difficult to solve with a very long runtime. To solve this problem Van Belle et al. (2011) found an efficient reduction that sorts observations in order of outcome time and then compares each data point \(i\) with the observation that has the next smallest survival time, skipping over censored observations, in maths: \(j(i) := \mathop{\mathrm{arg\,max}}_{j \in 1,...,n} \{t_j : t_j < t_i\}\). This is visualized in Figure 15.4 where six observations are sorted by outcome time from smallest (left) to largest (right). Starting from right to left, the first pair is made by matching the observation to the first uncensored outcome to the left, this continues to the end. In order for all observations to be used in the optimisation, the algorithm sets the first outcome to be uncensored hence observation \(2\) being compared to observation \(1\).

x-axis says 'observation', y-axis says 'outcome time'. There are six observations that increase linearly from bottom-left to top-right. The order is: uncensored, censored, uncensored, censored, uncensored, uncensored. Arrows show observation 6 matched with 4, 5 matched with 5, 4 matched with 2, 3 matched with 2, 2 matched with 1. — Figure 15.4: Van Belle SVM nearest neighbors reduction. Sorted observations are paired with the nearest uncensored outcome ‘to the left’. Red squares are uncensored observations and blue circles are censored. The observation with the smallest outcome time is always treated as uncensored.

Using this reduction, the algorithm becomes

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{\beta}\|^2 + \gamma\sum_{i =1}^n \xi_i \\ & \textrm{subject to} \begin{dcases} g(\mathbf{x}_i) - g(\mathbf{x}_{j(i)}) & \geq t_i - t_{j(i)} - \xi_i \\ \xi_i & \geq 0 \end{dcases} \end{aligned} \]

\(\forall i = 1,...,n\). Note the updated right hand side of the constraint, which plays a similar role to the \(\epsilon\) parameter by allowing ‘mistakes’ in predictions without penalty.

Predictions for a new observation \(\mathbf{x}^*\) are calculated as,

\[ \hat{g}(\mathbf{x}^*) = \sum^n_{i=1} \mu_i(K(\mathbf{x}_i, \mathbf{x}^*) - K(\mathbf{x}_{j(i)}, \mathbf{x}^*)) + \alpha \]

Where \(\mu_i\) are again Lagrange multipliers.

15.2.3 Hybrid SSVMs

Finally, Van Belle et al. (2011) noted that the ranking algorithm could be updated to add the constraints of the regression model, thus providing a model that simultaneously optimizes for ranking whilst providing continuous values that can be interpreted as survival time predictions. This results in the hybrid SSVM with constraints \(\forall i = 1,...,n\):

\[ \begin{aligned} & \min_{\boldsymbol{\beta}, \alpha, \boldsymbol{\xi}, \boldsymbol{\xi}', \boldsymbol{\xi}^*} \frac{1}{2}\|\boldsymbol{\beta}\|^2 + \textcolor{CornflowerBlue}{\gamma\sum_{i =1}^n \xi_i} + \textcolor{Rhodamine}{C \sum^n_{i=1}(\xi_i' + \xi_i^*)} \\ & \textrm{subject to} \begin{dcases} \textcolor{CornflowerBlue}{g(\mathbf{x}_i) - g(\mathbf{x}_{j(i)})} & \textcolor{CornflowerBlue}{\geq t_i - t_{j(i)} - \xi_i} \\ \textcolor{Rhodamine}{g(\mathbf{x}_i)} & \textcolor{Rhodamine}{\leq t_i + \xi^*_i, i:\delta_i=1} \\ \textcolor{Rhodamine}{g(\mathbf{x}_i)} & \textcolor{Rhodamine}{\geq t_i - \xi'_i} \\ \textcolor{CornflowerBlue}{\xi_i}, \textcolor{Rhodamine}{\xi_i', \xi_i^*} & \geq 0 \\ \end{dcases} \end{aligned} \label{eq:surv_ssvmvb2} \]

The blue parts of the equation make up the ranking model and the red parts are the regression model. \(\gamma\) is the penalty associated with the regression method and \(C\) is the penalty associated with the ranking method. Setting \(\gamma = 0\) results in the regression SVM and \(C = 0\) results in the ranking SSVM. Hence, fitting the hybrid model and tuning these parameters is an efficient way to automatically detect which SSVM is best suited to a given task.

Once the model is fit, a prediction from given features \(\mathbf{x}^* \in \mathbb{R}^p\), can be made using the equation below, again with the ranking and regression contributions highlighted in blue and red respectively.

\[ \hat{g}(\mathbf{x}^*) = \sum^n_{i=1} \textcolor{CornflowerBlue}{\mu_i(K(\mathbf{x}_i, \mathbf{x}^*) - K(\mathbf{x}_{j(i)}, \mathbf{x}^*))} + \textcolor{Rhodamine}{\mu^*_i K(\mathbf{x}_i, \mathbf{x}^*) - \delta_i\mu_i'K(\mathbf{x}_i, \mathbf{x}^*)} + \alpha \]

where \(\mu_i, \mu_i^*, \mu_i'\) are Lagrange multipliers and \(K\) is a chosen kernel function, which may have further hyper-parameters to select or tune.

15.3 Conclusion

Key takeaways

Support vector machines (SVMs) are a highly flexible machine learning method that can use the ‘kernel trick’ to represent infinite dimensional spaces in finite domains;
Survival SVMs (SSVMs) extend regression SVMs by either making survival time predictions, ranking predictions, or a combination of the two;
The hybrid SSVM provides an efficient method that encapsulates all the elements of regression and ranking SSVMs and is therefore a good model to include in benchmark experiments to test the potential of SSVMs.

Limitations

SSVMs can only perform well with extensive tuning of hyper-parameters over a wide parameter space. To-date, no papers have experimented with the tuning range for the \(\gamma\) and \(C\) parameters, we note (Fouodo et al. 2018) tune over \((2^{-5}, 2^5)\).
Even using the regression or hybrid model, the authors’ experiments with the SSVM have consistently shown ‘survival time’ estimates tend to be unrealistically large.
Due to the above limitation, regression estimates cannot be meaningful interpreted and as a consequence there is no sensible composition to create a distribution prediction from an SSVM. Hence, we are hesitant to suggest usage of SSVMs outside of ranking-based problems.