5 What are Survival Measures?

Abstract

TODO (150-200 WORDS)

Minor changes expected!

This page is a work in progress and minor changes will be made over time.

In this part of the book we discuss one of the most important parts of the machine learning workflow, model evaluation (Foss and Kotthoff 2024). In the next few chapters we will discuss different metrics that can be used to measure a model’s performance but before that we will just briefly discuss why model evaluation is so important.

In the simplest case, without evaluation there is no way to know if predictions from a trained machine learning model are any good. Whether one uses a simple Kaplan-Meier estimator, a complex neural network, or anything in between, there is no guarantee any of these methods will actually make useful predictions for a given dataset. This could be because the dataset is inherently difficult for any model to be trained on, perhaps because it is very ‘noisy’, or because a model is simply ill-suited to the task, for example using a Cox Proportional Hazards model when its key assumptions are violated. Evaluation is therefore crucial to trusting any predictions made from a model.

5.1 Survival Measures

Evaluation can be used to assess in-sample and out-of-sample performance.

In-sample evaluation measures the quality of a model’s ‘fit’ to data, i.e., whether the model has accurately captured relationships in the training data. However, in-sample measures often cannot be applied to complex machine learning models so this part of the book omits these measures. Readers who are interested in this are are directed to Collett (2014) and Hosmer Jr, Lemeshow, and May (2011) for discussion on residuals; Choodari-Oskooei, Royston, and Parmar (2012), Kent and O’Quigley (1988) and Royston and Sauerbrei (2004) for \(R^2\) type measures; and finally Volinsky and Raftery (2000), Hurvich and Tsai (1979), and Liang and Zou (2008) for information criterion measures.

Out-of-sample measures evaluate the quality of model predictions on new and previously unseen (by the model) data. By following established statistical methods for evaluation, and ensuring that robust resampling methods are used (James et al. 2013), evaluation provides a method for estimating the ‘generalisation error’, which is the expected model performance on new datasets. This is an important concept as it provides confidence about future model performance without limiting results to the current data. Survival measures are classified into measures of:

Discrimination (aka ‘separation’) – A model’s discriminatory power refers to how well it separates observations that are at a higher or lower risk of event. For example, a model with good discrimination will predict that (at a given time) a dead patient has a higher probability of being dead than an alive patient.
Calibration – Calibration is a roughly defined concept (Collins et al. 2014; Harrell, Lee, and Mark 1996; Rahman et al. 2017; Van Houwelingen 2000) that generally refers to how well a model captures average relationships between predicted and observed values.
Predictive Performance – A model is said to have good predictive performance (or sometimes ‘predictive accuracy’) if its predictions for new data are ‘close to’ the truth.

These measures could also be categorised into how they evaluate predictions. Discrimination measures compare predictions pairwise where pairs of observations are created and then the predictions for these pairs are compared within and across each other in some way. Calibration measures evaluate predictions holistically by looking at some ‘average’ performance across them to provide an idea of how well suited the model is to the data. Measures of predictive performance evaluate individual predictions and usually take the sample mean of these to estimate the generalisation error.

In the next few chapters we categorise measures by the type of survival prediction they evaluate, which is a more natural taxonomy for selecting measures, but we use the above categories when introducing each measure.

5.2 How are Models Evaluated?

As well as using measures to evaluate a model’s performance on a given dataset, evaluation can also be used to measure future performance, to compare and select models, and to tune internal processes. In most cases, models should not be trained/predicted/evaluated on their own, instead a number of simpler reference models should be simultaneously trained and evaluated on the same data, which is known as a ‘benchmark experiment’. This is especially important for survival models, as all survival measures depend on the censoring distribution and therefore cannot be interpreted out of context and without comparison to other models. Benchmark experiments are used to empirically compare models across the same data and measures, meaning that if one model outperforms another then that model can be selected for future experiments (though simpler models are preferred if the performance difference is marginal). A model is usually said to ‘outperform’ another if it has a lower generalisation error.

The process of model evaluation is dependent on the measure itself. Measures that are ‘decomposable’ (predictive performance measures) calculate scores for individual predictions and take the sample mean over all scores, on the other hand ‘aggregate’ measures (discrimination and calibration) return a single score over all predictions. The simplest method to estimate the generalisation error is ‘holdout’ resampling, where a dataset \(\mathcal{D}\) is split into non-overlapping subsets for training \(\mathcal{D}_{train}\) and testing \(\mathcal{D}_{test}\). The model is trained on \(\mathcal{D}_{train}\) and predictions, \(\hat{\mathbf{y}}\) are made based on the features in \(\mathcal{D}_{test}\). The model is evaluated by using a measure, \(L\), to compare the predictions to the observed data in the test set, \(L(\mathbf{y}_{test}, \hat{\mathbf{y}})\).

Where possible, (repeated) k-fold cross-validation (kCV) should be used for more robust estimation of the generalisation error and for model comparison. In kCV, the data is partitioned into \(k\) folds (often \(k\) is \(5\) or \(10\)), which are non-overlapping subsets. A model is trained on \(k-1\) folds and evaluated on the \(k\)th fold, this process is repeated until each of the \(k\) folds has acted as the test set exactly once, the computed loss from each iteration is averaged into the final loss, which provides a good estimate of the generalisation error.

For the rest of this part of the book we will introduce different survival measures, discuss their advantages and disadvantages, and in Chapter 10 we will provide some recommendations for choosing measures. We will not discuss the general process of model resampling or evaluation further but recommend Casalicchio and Burk (2024) to readers interested in this topic.

Casalicchio, Giuseppe, and Lukas Burk. 2024. “Evaluation and Benchmarking.” In Applied Machine Learning Using mlr3 in R, edited by Bernd Bischl, Raphael Sonabend, Lars Kotthoff, and Michel Lang. CRC Press. https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html.

Choodari-Oskooei, Babak, Patrick Royston, and Mahesh K. B. Parmar. 2012. “A simulation study of predictive ability measures in a survival model I: Explained variation measures.” Statistics in Medicine 31 (23): 2627–43. https://doi.org/10.1002/sim.4242.

Collett, David. 2014. Modelling Survival Data in Medical Research. 3rd ed. CRC.

Collins, Gary S., Joris A. De Groot, Susan Dutton, Omar Omar, Milensu Shanyinde, Abdelouahid Tajar, Merryn Voysey, et al. 2014. “External validation of multivariable prediction models: A systematic review of methodological conduct and reporting.” BMC Medical Research Methodology 14 (1): 1–11. https://doi.org/10.1186/1471-2288-14-40.

Foss, Natalie, and Lars Kotthoff. 2024. “Data and Basic Modeling.” In Applied Machine Learning Using mlr3 in R, edited by Bernd Bischl, Raphael Sonabend, Lars Kotthoff, and Michel Lang. CRC Press. https://mlr3book.mlr-org.com/data_and_basic_modeling.html.

Harrell, Frank E., Kerry L. Lee, and Daniel B. Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Statistics in Medicine 15: 361–87. https://doi.org/10.1002/0470023678.ch2b(i).

Hosmer Jr, David W, Stanley Lemeshow, and Susanne May. 2011. Applied survival analysis: regression modeling of time-to-event data. Vol. 618. John Wiley & Sons.

Hurvich, Clifford M, and Chih-Ling Tsai. 1979. “Comparison of Four Tests for Equality of Survival Curves in the Presence of Stratification and Censoring.” Biometrika 66 (3): 419–28. https://doi.org/10.1093/biomet/76.2.297.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. New York: Springer.

Kent, John T., and John O’Quigley. 1988. “Measures of dependence for censored survival data.” Biometrika 75 (3): 525–34. https://doi.org/10.1093/biomet/75.3.525.

Liang, Hua, and Guohua Zou. 2008. “Improved AIC Selection Strategy for Survival Analysis.” Computational Statistics & Data Analysis 52 (5): 2538–48. https://doi.org/10.1016/j.csda.2007.09.003.

Rahman, M. Shafiqur, Gareth Ambler, Babak Choodari-Oskooei, and Rumana Z. Omar. 2017. “Review and evaluation of performance measures for survival prediction models in external validation settings.” BMC Medical Research Methodology 17 (1): 1–15. https://doi.org/10.1186/s12874-017-0336-2.

Royston, Patrick, and Willi Sauerbrei. 2004. “A new measure of prognostic separation in survival data.” Statistics in Medicine 23 (5): 723–48. https://doi.org/10.1002/sim.1621.

Van Houwelingen, Hans C. 2000. “Validation, calibration, revision and combination of prognostic survival models.” Statistics in Medicine 19 (24): 3401–15. https://doi.org/10.1002/1097-0258(20001230)19:24<3401::AID-SIM554>3.0.CO;2-2.

Volinsky, Chris T, and Adrian E Raftery. 2000. “Bayesian Information Criterion for Censored Survival Models.” International Biometric Society 56 (1): 256–62.