9  Distance Measures

Abstract
This chapter discusses evaluation of distance measures for evaluating the distance between survival time predictions and the ground truth. The chapter explains why such evaluations are fundamentally problematic, and should be avoided, in the presence of censoring.

As discussed in Chapter 5, evaluating the distance between survival time point predictions and the ground truth event times is fundamentally problematic when censoring is present. In particular, there is no principled way to define a metric that accurately reflects the average discrepancy between predicted survival times and all event times. Restricting evaluation to uncensored observations and applying standard regression measures such as the mean absolute error (MAE) can introduce substantial bias (Wang et al. 2019). Moreover, Qi et al. (2023) show that naïvely applying the MAE to survival data can induce model performance rankings that are inconsistent with those obtained from the oracle MAE computed using fully observed event times; this remains true even when using inverse probability of censoring–weighted variants of the MAE. A pseudo-observation–based approach has been shown, under controlled simulation settings, to induce the same model rankings as the oracle MAE (Qi et al. 2023). However, this approach is not intended to, nor does it, recover a meaningful measure of absolute distance between predicted survival times and the ground truth event times.

Given these limitations, this book recommends avoiding distance-based evaluation of survival time predictions. Instead, discrimination-based measures (Chapter 6) may be used to evaluate survival time predictions whilst being clear this is an evaluation of the relative risk of event only and not an attempt to quantify the distance between predictions and truth.

9.1 Conclusion

WarningKey takeaways
  • Survival time point predictions cannot be evaluated using standard distance-based error measures in a principled way when censoring is present.
  • Existing approaches can support relative comparison of models but do not resolve the fundamental ambiguity of unobserved event times.
  • As a result, discrimination-based evaluation should be preferred, with careful interpretation of what is, and is not, being evaluated.