6 Discrimination

Abstract

TODO (150-200 WORDS)

Minor changes expected!

This page is a work in progress and minor changes will be made over time.

This chapter discusses ‘discrimination measures’, which evaluate how well models separate observations into different risk groups. A model is said to have good discrimination if it correctly predicts that one observation is at higher risk of the event of interest than another, where the prediction is ‘correct’ if the observation predicted to be at higher risk does indeed experience the event sooner.

In the survival setting, the ‘risk’ is taken to be the continuous ranking prediction. All discrimination measures are ranking measures, which means that the exact predicted value is irrelevant, only its relative ordering is required. For example given predictions \(\{100,2,299.3\}\), only their rankings, \(\{2,1,3\}\), are used by measures of discrimination.

This chapter begins with time-independent measures (Section 6.1), which measure concordance between pairs of observations at a single observed time point. The next section focuses on time-dependent measures (Section 6.2), which are primarily AUC-type measures that evaluate discrimination over all possible unique time-points and integrate the results for a single metric.

6.1 Time-Independent Measures

The simplest form of discrimination measures are concordance indices, which, in general, measure the proportion of cases in which the model correctly ranks a pair of observations according to their risk. These measures may be best understood in terms of two key definitions: ‘comparable’, and ‘concordant’.

Definition 6.1 (Concordance) Let \((i,j)\) be a pair of observations with outcomes \(\{(t_i,\delta_i),(t_j,\delta_j)\}\) and let \(r_i,r_j \in \mathbb{R}\) be their respective risk predictions. Then \((i,j)\) are called (F. E. J. Harrell et al. 1984; F. E. Harrell, Califf, and Pryor 1982):

Comparable if \(t_i < t_j\) and \(\delta_i = 1\);
Concordant if \(r_i > r_j\).

Note that this book defines risk rankings such that a higher value implies higher risk of event and thus lower expected survival time (Chapter 5), hence a pair is concordant if \(\mathbb{I}(t_i < t_j, r_i > r_j)\). Other sources may instead assume that higher values imply lower risk of event and hence a pair would be concordant if \(\mathbb{I}(t_i < t_j, r_i < r_j)\).

Concordance measures then estimate the probability of a pair being concordant, given that they are comparable:

\[ P(r_i > r_j | t_i < t_j \cap \delta_i) \tag{6.1}\]

These measures are referred to as time independent when \(r_i,r_j\) is not a function of time as once the observations are organized into comparable pairs, the observed survival times can be ignored. The time-dependent case is covered in Section 6.2.1.

While various definitions of a ‘Concordance index’ (C-index) exist (discussed in the next section), they all represent a weighted proportion of the number of concordant pairs over the number of comparable pairs. As such, a C-index value will always be between \([0, 1]\) with \(1\) indicating perfect separation, \(0.5\) indicating no separation, and \(0\) being separation in the ‘wrong direction’, i.e. all high risk observations being ranked lower than all low risk observations.

Concordance measures may either be reported as a value in \([0,1]\), a percentage, or as ‘discriminatory power’, which refers to the percentage improvement of a model’s discrimination above the baseline value of \(0.5\). For example, if a model has a concordance of \(0.8\) then its discriminatory power is \((0.8-0.5)/0.5 = 60\%\). This representation of discrimination provides more information by encoding the model’s improvement over some baseline although is often confused with reporting concordance as a percentage (e.g. reporting a concordance of 0.8 as 80%). In theory this representation could result in a negative value, however this would indicate that \(C<0.5\), which would indicate serious problems with the model that should be addressed before proceeding with further analysis. Representing measures as a percentage over a baseline is a common method to improve measure interpretability and closely relates to the ERV representation of scoring rules (Section 8.4).

6.1.1 Concordance Indices

Common concordance indices in survival analysis can be expressed as a general measure:

Let \(\boldsymbol{\hat{r}} = ({\hat{r}}_1 \ {\hat{r}}_2 \cdots {\hat{r}}_{n})^\top\) be predicted risks, \((\mathbf{t}, \boldsymbol{\delta}) = ((t_1, \delta_1) \ (t_2, \delta_2) \cdots (t_n, \delta_n))^\top\) be observed outcomes, let \(W\) be some weighting function, and let \(\tau\) be a cut-off time. Then, the time-independent (‘ind’) survival concordance index is defined by,

\[ C_{ind}(\hat{\mathbf{r}}, \mathbf{t}, \boldsymbol{\delta}|\tau) = \frac{\sum_{i\neq j} W(t_i)\mathbb{I}(t_i < t_j, \hat{r}_i > \hat{r}_j, t_i < \tau)\delta_i}{\sum_{i\neq j}W(t_i)\mathbb{I}(t_i < t_j, t_i < \tau)\delta_i} \tag{6.2}\]

The choice of \(W\) specifies a particular variation of the c-index (see below). The use of the cut-off \(\tau\) mitigates against decreased sample size (and therefore high variance) over time due to the removal of censored observations (see Figure 6.1). For \(\tau\) to be comparable across datasets, a common choice would be to set \(\tau\) as the time at which 80%, or perhaps 90% of the data have been censored or experienced the event.

There are multiple methods for dealing with tied predictions and times. Strictly, tied times are incomparable given the definition of ‘comparable’ given above and hence are usually ignored in the numerator. On the other hand, ties in the prediction are more problematic but a common method is to set a value of \(0.5\) for observations when \(r_i = r_j\) (Therneau and Atkinson 2024). Specific concordance indices can be constructed by assigning a weighting scheme for \(W\) which generally depends on the Kaplan-Meier estimate of the survival function of the censoring distribution fit on training data, \(\hat{G}_{KM}\), or the Kaplan-Meier estimate for the survival function of the survival distribution fit on training data, \(\hat{S}_{KM}\), or both. Measures that use \(\hat{G}_{KM}\) are referred to as Inverse Probability of Censoring Weighted (IPCW) measures as the estimated censoring distribution is utilised to weight the measure in order to compensate for removed censored observations. This is visualised in Figure 6.1 where \(\hat{G}_{KM}\), \(\hat{G}_{KM}^{-2}\), and \(\hat{S}_{KM}\) are computed based on the whas dataset (Hosmer Jr, Lemeshow, and May 2011).

Line graph with three lines in green, red, and blue. x-axis is labelled 't' and ranges from 0 to 6000. y-axis is labelled 'W(t)' and ranges from 0 to 5. Legend for lines is titled 'W' with entries 'KMG' for the red line, 'KMG^-2' for the green line, and 'KMS' for the blue line. The blue line starts at (0,1), moves to around (1000, 0.5) then is relatively flat. The red line roughly linearly decreases from (0,1) to (6000,0). The green line sharply increases between (1, 1000) to (5, 3500). A vertical gray line passes through (0, 1267). — Figure 6.1: Weighting functions obtained on the `whas` dataset. x-axis is follow-up time. y-axis is outputs from one of three weighting functions: \(\hat{G}_{KM}\), survival function based on the censoring distribution of the `whas` dataset (red), and \(\hat{G}_{KM}^{-2}\) (green), \(\hat{S}_{KM}\), marginal survival function based on original `whas` dataset (blue), . The vertical gray line at \(t = \tau=1267\) represents the point at which \(\hat{G}(t)<0.6\).

The following are a few of the weights that have been proposed for the concordance index:

\(W(t_i) = 1\): Harrell’s concordance index, \(C_H\) (F. E. J. Harrell et al. 1984; F. E. Harrell, Califf, and Pryor 1982), which is widely accepted to be the most common survival measure and imposes no weighting on the definition of concordance. The original measure given by Harrell has no cut-off, \(\tau = \infty\), however applying a cut-off is now more widely accepted in practice.
\(W(t_i) = [\hat{G}_{KM}(t_i)]^{-2}\): Uno’s C, \(C_U\) (Uno et al. 2011).
\(W(t_i) = \hat{S}_{KM}(t_i)\) (Therneau and Atkinson 2024)
\(W(t_i) = \hat{S}_{KM}(t_i)/\hat{G}_{KM}(t_i)\) (Schemper, Wakounig, and Heinze 2009)

All methods assume that censoring is conditionally-independent of the event given the features (Section 3.3), otherwise weighting by \(\hat{S}_{KM}\) or \(\hat{G}_{KM}\) would not be applicable. It is assumed here that \(\hat{S}_{KM}\) and \(\hat{G}_{KM}\) are estimated on the training data and not the testing data (though the latter may be seen in some implementations, e.g. Therneau (2015)).

6.1.2 Choosing a C-index

With multiple choices of weighting available, choosing a specific measure might seem daunting. Matters are only made worse by debate in the literature, reflecting uncertainty in measure choice and interpretation. In practice, when a suitable cut-of \(\tau\) is chosen, all these weightings perform very similarly (Rahman et al. 2017; Schmid and Potapov 2012). For example, Table 6.1 uses the whas data again to compare Harrell’s C with measures that include IPCW weighting, when no cutoff is applied (top row) and when a cutoff is applied when \(\hat{G}(t)=0.6\) (grey line in Figure 6.1). The results are almost identical when the cutoff is applied but still not massively different without the cutoff.

Table 6.1: Comparing C-index measures (calculated on the whas dataset using a Cox model with three-fold cross-validation) with no cut-off (top) and a cut-off when \(\hat{G}(t)=0.6\) (bottom). First column is Harrell’s C, second is the weighting \(1/\hat{G}(t)\), third is Uno’s C.

	\(W=1\)	\(W= G^{-1}\)	\(W=G^{-2}\)
\(\tau=\infty\)	0.74	0.73	0.71
\(\tau=1267\)	0.76	0.75	0.75

On the other hand, if a poor choice is selected for \(\tau\) (cutting off too late) then IPCW measures can be highly unstable (Rahman et al. 2017), for example the variance of Uno’s C drastically increases with increased censoring (Schmid and Potapov 2012).

In practice, all C-index metrics provide an intuitive measure of discrimination and as such the choice of C-index is less important than the transparency in reporting. ‘C-hacking’ (R. Sonabend, Bender, and Vollmer 2022) is the deliberate, unethical procedure of calculating multiple C-indices and to selectively report one or more results to promote a particular model or result, whilst ignoring any negative findings. For example, calculating Harrell’s C and Uno’s C but only reporting the measure that shows a particular model of interest is better than another (even if the other metric shows the reverse effect). To avoid ‘C-hacking’:

the choice of C-index should be made before experiments have begun and the choice of C-index should be clearly reported;
when ranking predictions are composed (?sec-car) from distribution predictions, the composition method should be chosen and clearly described before experiments have begun.

As the C-index is highly dependent on censoring within a dataset, C-index values between experiments are not directly comparable, instead comparisons are limited to comparing model rankings, for example conclusions such as “model A outperformed model B with respect to Harrell’s C in this experiment”.

6.2 Time-Dependent Measures

In the time-dependent case, where the metrics are computed based on specific survival times, the majority of measures are based on the Area Under the Curve, with one exception which is a simpler concordance index.

6.2.1 Concordance Indices

In contrast to the measures described above, Antolini’s C (Antolini, Boracchi, and Biganzoli 2005) provides a time-dependent (‘dep’) formula for the concordance index. The definition of ‘comparable’ is the same for Antolini’s C, however, concordance is now determined using the individual predicted survival probabilities calculated at the smaller event time in the pair:

\[ P(\hat{S}_i(t_i) < \hat{S}_j(t_i) | t_i < t_j \cap \delta_i) \]

Note that observations are concordant when \(\hat{S}_i(t_i) < \hat{S}_j(t_i)\) as at the time \(t_i\), observation \(i\) has experienced the event and observation \(j\) has not, hence the expected survival probability for \(\hat{S}_i(t_i)\) should be as close to 0 as possible (noting inherent randomness may prevent the perfect \(\hat{S}_i(t_i)=0\) prediction) but otherwise should be less than \(\hat{S}_j(t_i)\) as \(j\) is still ‘alive’. Once again this probability is estimated with a metric that could include a cut-off and different weighting schemes (though this is not included in Antolini’s original definition):

\[ C_{dep}(\hat{\mathbf{S}}, \mathbf{t}, \boldsymbol{\delta}|\tau) = \frac{\sum_{i\neq j} W(t_i)\mathbb{I}(t_i < t_j, \hat{S}_i(t_i) < \hat{S}_j(t_i), t_i < \tau)\delta_i}{\sum_{i\neq j}W(t_i)\mathbb{I}(t_i < t_j, t_i < \tau)\delta_i} \]

where \(\boldsymbol{\hat{S}} = ({\hat{S}}_1 \ {\hat{S}}_2 \cdots {\hat{S}}_{n})^\top\).

Antolini’s C provides an intuitive method to evaluate the discrimination of a model based on distribution predictions without depending on compositions to ranking predictions.

6.2.2 Area Under the Curve

AUC, or AUROC, measures calculate the Area Under the Receiver Operating Characteristic (ROC) Curve, which is a plot of the sensitivity (or true positive rate (TPR)) against \(1 - \textit{specificity}\) (or true negative rate (TNR)) at varying thresholds (described below) for the predicted probability (or risk) of event. Figure 6.2 visualises ROC curves for two classification models. The blue line is a featureless baseline that has no discrimination. The red line is a decision tree with better discrimination as it comes closer to the top-left corner.

Image shows graph with '1 - Specificity' on the x-axis from 0 to 1 and 'Sensitivity' on the y-axis from 0 to 1. There is a blue line from the bottom left (0,0) to the top right (1,1) of the graph and a red line that forms a curve from (0,0) to around (0.2,0.8) then (1,1). — Figure 6.2: ROC Curves for a classification example. Red is a decision tree with good discrimination as it ‘hugs’ the top-left corner. Blue is a featureless baseline with no discrimination as it sits on \(y = x\).

In a classification setting with no censoring, the AUC has the same interpretation as Harrell’s C (Uno et al. 2011). AUC measures for survival analysis were developed to provide a time-dependent measure of discriminatory ability (Patrick J. Heagerty, Lumley, and Pepe 2000). In a survival setting it can reasonably be expected for a model to perform differently over time and therefore time-dependent measures are advantageous. Computation of AUC estimators is complex and as such there are limited accessible metrics available off-shelf.

The AUC, TPR, and TNR are derived from the confusion matrix in a binary classification setting. Let \(y_i,\hat{y}_i \in \{0, 1\}\) be the true and predicted binary outcomes respectively for some observation \(i\). The confusion matrix is then given by:

	\(y_i = 1\)	\(y_i = 0\)
\(\hat{y}_i = 1\)	TP	FP
\(\hat{y}_i = 0\)	FN	TN

where \(TN := \sum_i \mathbb{I}(y_i = 0, \hat{y}_i = 0)\) is the number of true negatives, \(TP := \sum_i \mathbb{I}(y_i = 1, \hat{y}_i = 1)\) is the number true positives, \(FP := \sum_i \mathbb{I}(y_i = 0, \hat{y}_i = 1)\) is the number of false positives, and \(FN := \sum_i \mathbb{I}(y_i = 1, \hat{y}_i = 0)\) is the number of false negatives. From these are derived

\[ \begin{aligned} & TPR := \frac{TP}{TP + FN} \\ & TNR := \frac{TN}{TN + FP} \end{aligned} \]

In classification, a probabilistic prediction of an event can be thresholded to obtain a deterministic prediction. For a predicted \(\hat{p} := \hat{P}(b = 1)\), and threshold \(\alpha\), the thresholded binary prediction is \(\hat{b} := \mathbb{I}(\hat{p} > \alpha)\). This is achieved in survival analysis by thresholding the linear predictor at a given time for different values of the threshold and different values of the time. All measures of TPR, TNR and AUC are in the range \([0,1]\) with larger values preferred.

Weighting the linear predictor was proposed by Uno \(\textit{et al.}\) (2007) (Uno et al. 2007) and provides a method for estimating TPR and TNR via

\[ \begin{split} &TPR_U(\hat{\boldsymbol{\eta}}, \mathbf{t}, \boldsymbol{\delta}| \tau, \alpha) = \frac{\sum^n_{i=1} \delta_i \mathbb{I}(k(\hat{\eta}_i) > \alpha, t_i \leq \tau)[\hat{G}_{KM}(t_i)]^{-1}}{\sum^n_{i=1}\delta_i\mathbb{I}(t_i \leq \tau)[\hat{G}_{KM}(t_i)]^{-1}} \end{split} \]

and

\[ \begin{split} &TNR_U(\hat{\boldsymbol{\eta}}, \mathbf{t}| \tau, \alpha) \mapsto \frac{\sum^n_{i=1} \mathbb{I}(k(\hat{\eta}_i) \leq \alpha, t_i > \tau)}{\sum^n_{i=1}\mathbb{I}(t_i > \tau)} \end{split} \]

where \(\boldsymbol{\hat{\eta}} = ({\hat{\eta}}_1 \ {\hat{\eta}}_2 \cdots {\hat{\eta}}_{n})^\top\) is a vector of estimated linear predictors, \(\tau\) is the time at which to evaluate the measure, \(\alpha\) is a cut-off for the linear predictor, and \(k\) is a known, strictly increasing, differentiable function. \(k\) is chosen depending on the model choice, for example if the fitted model is PH then \(k(x) = 1 - \exp(-\exp(x))\) (Uno et al. 2007). Similarities can be drawn between these equations and Uno’s concordance index, in particular the use of IPCW. Censoring is again assumed to be at least random once conditioned on features. Plotting \(TPR_U\) against \(1 - TNR_U\) for varying values of \(\alpha\) provides the ROC.

The second method, which appears to be more prominent in the literature, is derived from Heagerty and Zheng (2005) (Patrick J. Heagerty and Zheng 2005). They define four distinct classes, in which observations are split into controls and cases.

An observation is a case at a given time-point if they are dead, otherwise they are a control. These definitions imply that all observations begin as controls and (hypothetically) become cases over time. Cases are then split into incident or cumulative and controls are split into static or dynamic. The choice between modelling static or dynamic controls is dependent on the question of interest. Modelling static controls implies that a ‘subject does not change disease status’ (Patrick J. Heagerty and Zheng 2005), and few methods have been developed for this setting (Kamarudin, Cox, and Kolamunnage-Dona 2017), as such the focus here is on dynamic controls. The incident/cumulative cases choice is discussed in more detail below.¹

¹ All measures discussed in this section evaluate model discrimination from ‘markers’, which may be a predictive marker (model predictions) or a prognostic marker (a single covariate). This section always defines a marker as a ranking prediction, which is valid for all measures discussed here with the exception of one given at the end.

The TNR for dynamic cases is defined as

\[ TNR_D(\hat{\mathbf{r}}, N | \alpha, \tau) = P(\hat{r}_i \leq \alpha | t_i > \tau) \] where \(\boldsymbol{\hat{r}} = ({\hat{r}}_1 \ {\hat{r}}_2 \cdots {\hat{r}}_{n})^\top\) is some prediction, which is usually deterministic, often the linear predictor \(\hat{\eta}\), however could be a predicted survival probability – though in the latter case there is not necessarily an advantage over Antolini’s C. Cumulative and incident versions of the TPR are respectively defined by

\[ TPR_C(\hat{\mathbf{r}}, N | \alpha, \tau) = P(\hat{r}_i > \alpha | t_i \leq \tau) \] and

\[ TPR_I(\hat{\mathbf{r}}, N | \alpha, \tau) = P(\hat{r}_i > \alpha | t_i = \tau) \]

A ‘true negative’ occurs when the risk is below a certain threshold and the event has yet to occur, in contrast a ‘true positive’ occurs when the event has already occurred and therefore the risk is expected to be above the threshold. Estimation of these quantities depends on non-parametric estimators, such as the Kaplan-Meier and Akritas estimator (?sec-surv-models-uncond), further details are not provided here.

The choice between the incident/dynamic (I/D) and cumulative/dynamic (C/D) measures primarily relates to the use-case. The C/D measures are preferred if a specific time-point is of interest (Patrick J. Heagerty and Zheng 2005) and is implemented in several applications for this purpose (Kamarudin, Cox, and Kolamunnage-Dona 2017). The I/D measures are preferred when the true survival time is known and discrimination is desired at the given event time (Patrick J. Heagerty and Zheng 2005).

Defining a time-specific AUC is now possible with

\[ AUC(\hat{\mathbf{r}}, N | \tau) = \int^1_0 TPR(\hat{\mathbf{r}}, N | 1 - TNR^{-1}(p|\tau), \tau) \ dp \]

Finally, integrating over all time-points produces a time-dependent AUC and as usual a cut-off is applied for the upper limit,

\[ AUC^*(\hat{\mathbf{r}},N|\tau^*) = \int^{\tau^*}_0 AUC(\hat{\mathbf{r}},N|\tau)\frac{2\hat{p}_{KM}(\tau)\hat{S}_{KM}(\tau)}{1 - \hat{S}_{KM}^2(\tau^*)} \ d\tau \] where \(\hat{S}_{KM},\hat{p}_{KM}\) are survival and mass functions estimated with a Kaplan-Meier model on training data.

Since Heagerty and Zheng’s paper, other methods for calculating the time-dependent AUC have been devised, including by Chambless and Diao (Chambless and Diao 2006), Song and Zhou (Song and Zhou 2008), and Hung and Chiang (Hung and Chiang 2010). These either stem from the Heagerty and Zheng paper or ignore the case/control distinction and derive the AUC via different estimation methods of TPR and TNR. Blanche \(\textit{et al.}\) (2012) (Blanche, Latouche, and Viallon 2012) surveyed these and concluded ‘’regarding the choice of the retained definition for cases and controls, no clear guidance has really emerged in the literature’‘, but agree with Heagerty and Zeng on the use of C/D for clinical trials and I/D for ’pure’ evaluation of the marker. Blanche \(\textit{et al.}\) (2013) (Blanche, Dartigues, and Jacqmin-Gadda 2013) published a survey of C/D AUC measures with an emphasis on non-parametric estimators with marker-dependent censoring, including their own Conditional IPCW (CIPCW) AUC, which is not discussed further here as it cannot be used for evaluating predictions (R. E. B. Sonabend 2021).

Reviews of AUC measures have produced (sometimes markedly) different results (Blanche, Latouche, and Viallon 2012; L. Li, Greene, and Hu 2018; Kamarudin, Cox, and Kolamunnage-Dona 2017) with no clear consensus on how and when these measures should be used. The primary advantage of these measures is to extend discrimination metrics to be time-dependent. However, it is unclear how to interpret a threshold of a linear predictor and moreover if this is even the ‘correct’ quantity to threshold, especially when survival distribution predictions are the more natural object to evaluate over time.

6.3 Extensions

6.3.1 Competing risks

Discrimination measures are usually extended to the competing risk setting by evaluating cause-specific probabilities individually and then potentially summing or averaging over cause-specific measures (Geloven et al. 2022; Lee et al. 2018; Bender et al. 2021; Alberge et al. 2025).

To recap formulae from Chapter 4, given \(q\) possible events then the cause-specific hazard for an event is defined as \(h_{e}\):

\[ h_{e}(\tau) = \lim_{\Delta \tau \to 0} \frac{P(\tau \leq Y \leq \tau + \Delta \tau, E = e\ |\ Y \geq \tau)}{\Delta \tau}, \; e = 1, \dots, q. \]

where \(Y\) is the random variable representing the time-to-event and \(E\in \{1,\ldots,k\}\) is the random variable with realizations \(e\), which denotes one of \(k\) competing events that can occur at event time \(Y\).

In a single-event setting, a pair of observations, \((i,j)\), with outcomes \(\{(t_i,\delta_i),(t_j,\delta_j)\}\), and predicted risk predictions, \(r_i,r_j \in \mathbb{R}\), are called comparable if \(t_i < t_j\) and \(\delta_i = 1\); and concordant if also \(r_i > r_j\). In a competing risks setting, for an event of interest \(e\), the observations are (Geloven et al. 2022):

Comparable if \(t_i < t_j\) and \(\delta_{ie} = 1\); and
Concordant if \(r_i > r_j\)

Where \(\delta_{ie} = 1\) is equivalent to \(\mathbb{I}(Y_i \leq C_i \wedge E_i = e)\).

The usual definition of concordance measures then follow given this additional conditioning on the event of interest \(e\). In practice, given that competing risks models often estimate cause-specific hazard functions, it is a variation of Antolini’s time-dependent concordance measure that suits the competing risks setting best. Hence for an event \(e\), the measure is given by:

\[ C(\hat{\mathbf{h}}_e, \mathbf{t}, \boldsymbol{\delta}|\tau) = \frac{\sum_{i\neq j} W(t_i)\mathbb{I}(t_i < t_j, \hat{h}_{e_i}(t_i) > \hat{h}_{e_j}(t_i), t_i < \tau)\delta_{ie}}{\sum_{i\neq j}W(t_i)\mathbb{I}(t_i < t_j, t_i < \tau)\delta_{ie}} \]

where \(\hat{\mathbf{h}}_e = (\hat{h}_{e_1} \ \hat{h}_{e_2} \cdots \hat{h}_{e_n})^\top\) are cause specific hazards for individual observations risk of event \(e\).

6.3.2 Other censoring and truncation types

AUC and concordance indices are designed to rank observations, or at least evaluate how well a model discriminates between two risk groups. This is a substantial challenge when there is interval censoring in the data as the ‘true’ ranks of observations are unknown. Let \(i,j\) be two observations with interval censoring times \((l_i, r_j), (l_j, r_j)\) respectively. Given valid combinations (r > l) then these observations may coincide with one of six combinations:

\(l_i < r_i < l_j < r_j\)
\(l_i < l_j < r_i < r_j\)
\(l_i < l_j < r_j < r_i\)
\(l_j < r_j < l_i < r_i\)
\(l_j < l_i < r_j < r_i\)
\(l_j < l_i < r_i < r_j\)

Of these, only case (1) and (4) can be used in a standard concordance index as the intervals are non-overlapping. For all other cases, conditional probabilities have to be substituted or imputation used to estimate when in an interval the event takes place (Tsouprou 2015; Ying Wu and Cook 2020). After such estimation, interpretation of any metric is difficult, as it’s unclear if the original prediction is being evaluated or the guesswork that went into the evaluation measure. In this case it is likely better to use more straightforward methods that do not require these extra steps.

When it comes to time-dependent AUC measures, the TNR and TPR can be extended to accommodate interval-censoring, by updating the equations to only include contributions from observations when the event is guaranteed to have occurred (\(r_i \leq \tau\)) or not (\(l_i > \tau\)) (J. Li and Ma 2011):

\[ TNR_D(\hat{\mathbf{r}}, N | \alpha, \tau) = P(\hat{r}_i \leq \alpha | l_i > \tau) \]

and

\[ TPR_C(\hat{\mathbf{r}}, N | \alpha, \tau) = P(\hat{r}_i > \alpha | r_i \leq \tau) \]

In the presence of interval censoring, estimation depends on more complex estimators such as the nonparametric maximum likelihood estimator, which can be challenging to fit and splines could be used instead (Yuan Wu et al. 2020). However, splines themselves require modelling and any metric that requires modelling to estimate introduces significant difficulties in its interpretation. As a more simplistic (yet sometimes realistic on average) alternative, one could impute the outcome time as the midpoint between the interval margins, \(t_i = (l_i + r_i)/2\) (Jacqmin-Gadda et al. 2016).

As left-censoring is a specialised case of interval-censoring (with \(l_i = 0\)) the above formulae can be used to handle left-censored data as well.

6.3.3 Truncation

When only right-censoring is present, the objective of concordance measures is to estimate the probability that the ranking of two observations’ risk is correctly specified (6.1). The naive method to deal with left-truncation is to simply ignore it and use 6.2 as usual. Doing so yields an estimator, \(C_N\), that converges in probability to

\[ P(r_i > r_j | \delta_i, t_i < t_j, t_i \geq t^L_i, t_j \geq t^L_j) \tag{6.3}\]

where \(t^L_i, t^L_j\) are the left-truncation times for \(i\) and \(j\) respectively (which are \(0\) if the observation is not left-truncated). This is considered a ‘naive’ estimator as it may introduce bias (Hartman et al. 2022), to see why this is the case consider that the right-hand-side of 6.3 is true in one of two cases (ignoring ties):

\(t_i < t_j^L < t_j\);
\(t_j^L < t_i < t_j\).

In the first case, \(j\) enters the study after \(i\) has experienced the outcome and the observations never overlap in the data at the same time Hartman et al. (2022) demonstrate that evaluating discrimination in the presence of these ‘nonoverlapping intervals’ creates bias if truncation is dependent on the predictors (for the same reason, methods for estimation need to be adapted as well). Therefore, the data should be further restricted to overlapping intervals, yielding the estimator

\[ C_{LT}(\hat{\mathbf{r}}, \mathbf{t}, \boldsymbol{\delta}, \mathbf{t}^L|\tau) = \frac{\sum_{i\neq j} \mathbb{I}(t_i < t_j, \hat{r}_i > \hat{r}_j, t_i < \tau, t_i \geq t^L_j)\delta_i}{\sum_{i\neq j}\mathbb{I}(t_i < t_j, t_i < \tau, t_i \geq t^L_j)\delta_i} \]

In contrast to 6.2, no generic weighting term is included, which is due to the complexity of estimating IPC weights in the context of left-truncation (Therneau and Atkinson 2024). Interestingly, experiments have shown that naively applying Harrell’s C (6.2 with \(W=1\)) may be a better estimator than \(C_{LT}\) with lower bias (Hartman et al. 2022). Whilst more complex IPC measures have been proposed, they do not seem to be in common usage anywhere whereas \(C_N\) and \(C_{LT}\) are both used in published research and software (McGough et al. 2021; Therneau 2015). As has been seen throughout this book, left-truncation research is still nascent. No evidence of methods for right-truncation could be found.

Alberge, Julie, Vincent Maladière, Olivier Grisel, Judith Abécassis, and Gaël Varoquaux. 2025. “P52 - Survival Models: Proper Scoring Rule and Stochastic Optimization with Competing Risks.” Journal of Epidemiology and Population Health 73: 203083. https://doi.org/https://doi.org/10.1016/j.jeph.2025.203083.

Antolini, Laura, Patrizia Boracchi, and Elia Biganzoli. 2005. “A time-dependent discrimination index for survival data.” Statistics in Medicine 24 (24): 3927–44. https://doi.org/10.1002/sim.2427.

Bender, Andreas, David Rügamer, Fabian Scheipl, and Bernd Bischl. 2021. “A General Machine Learning Framework for Survival Analysis.” In Machine Learning and Knowledge Discovery in Databases, edited by Frank Hutter, Kristian Kersting, Jefrey Lijffijt, and Isabel Valera, 158–73. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-67664-3_10.

Blanche, Paul, Jean-François Dartigues, and Hélène Jacqmin-Gadda. 2013. “Review and comparison of ROC curve estimators for a time-dependent outcome with marker-dependent censoring.” Biometrical Journal 55 (5): 687–704. https://doi.org/10.1002/bimj.201200045.

Blanche, Paul, Aurélien Latouche, and Vivian Viallon. 2012. “Time-dependent AUC with right-censored data: a survey study,” October. https://doi.org/10.1007/978-1-4614-8981-8_11.

Chambless, Lloyd E, and Guoqing Diao. 2006. “Estimation of time-dependent area under the ROC curve for long-term risk prediction.” Statistics in Medicine 25 (20): 3474–86. https://doi.org/10.1002/sim.2299.

Geloven, Nan van, Daniele Giardiello, Edouard F Bonneville, Lucy Teece, Chava L Ramspek, Maarten van Smeden, Kym I E Snell, et al. 2022. “Validation of Prediction Models in the Presence of Competing Risks: A Guide Through Modern Methods.” BMJ 377. https://doi.org/10.1136/bmj-2021-069249.

Harrell, F E Jr, K L Lee, R M Califf, D B Pryor, and R A Rosati. 1984. “Regression modelling strategies for improved prognostic prediction.” Statistics in Medicine 3 (2): 143–52. https://doi.org/10.1002/sim.4780030207.

Harrell, Frank E., Robert M. Califf, and David B. Pryor. 1982. “Evaluating the yield of medical tests.” JAMA 247 (18): 2543–46. http://dx.doi.org/10.1001/jama.1982.03320430047030.

Hartman, Nicholas, Sehee Kim, Kevin He, and John D. Kalbfleisch. 2022. “Concordance Indices with Left-Truncated and Right-Censored Data.” Biometrics 79 (3): 1624–34. https://doi.org/10.1111/biom.13714.

Heagerty, Patrick J., Thomas Lumley, and Margaret S. Pepe. 2000. “Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker.” Biometrics 56 (2): 337–44. https://doi.org/10.1111/j.0006-341X.2000.00337.x.

Heagerty, Patrick J, and Yingye Zheng. 2005. “Survival Model Predictive Accuracy and ROC Curves.” Biometrics 61 (1): 92–105. https://doi.org/10.1111/j.0006-341X.2005.030814.x.

Hosmer Jr, David W, Stanley Lemeshow, and Susanne May. 2011. Applied survival analysis: regression modeling of time-to-event data. Vol. 618. John Wiley & Sons.

Hung, Hung, and Chin-Tsang Chiang. 2010. “Estimation methods for time-dependent AUC models with survival data.” The Canadian Journal of Statistics / La Revue Canadienne de Statistique 38 (1): 8–26. http://www.jstor.org/stable/27805213.

Jacqmin-Gadda, Hélène, Paul Blanche, Emilie Chary, Célia Touraine, and Jean-François Dartigues. 2016. “Receiver Operating Characteristic Curve Estimation for Time to Event with Semicompeting Risks and Interval Censoring.” Statistical Methods in Medical Research 25 (6): 2750–66. https://doi.org/10.1177/0962280214531691.

Kamarudin, Adina Najwa, Trevor Cox, and Ruwanthi Kolamunnage-Dona. 2017. “Time-dependent ROC curve analysis in medical research: Current methods and applications.” BMC Medical Research Methodology 17 (1): 1–19. https://doi.org/10.1186/s12874-017-0332-6.

Lee, Changhee, William Zame, Jinsung Yoon, and Mihaela Van der Schaar. 2018. “DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks.” Proceedings of the AAAI Conference on Artificial Intelligence 32 (1). https://doi.org/10.1609/aaai.v32i1.11842.

Li, Jialiang, and Shuangge Ma. 2011. “Time-Dependent ROC Analysis Under Diverse Censoring Patterns.” Statistics in Medicine 30 (11): 1266–77. https://doi.org/https://doi.org/10.1002/sim.4178.

Li, Liang, Tom Greene, and Bo Hu. 2018. “A simple method to estimate the time-dependent receiver operating characteristic curve and the area under the curve with right censored data.” Statistical Methods in Medical Research 27 (8): 2264–78. https://doi.org/10.1177/0962280216680239.

McGough, Sarah F., Devin Incerti, Svetlana Lyalina, Ryan Copping, Balasubramanian Narasimhan, and Robert Tibshirani. 2021. “Penalized Regression for Left-Truncated and Right-Censored Survival Data.” Statistics in Medicine 40 (25): 5487–5500. https://doi.org/https://doi.org/10.1002/sim.9136.

Rahman, M. Shafiqur, Gareth Ambler, Babak Choodari-Oskooei, and Rumana Z. Omar. 2017. “Review and evaluation of performance measures for survival prediction models in external validation settings.” BMC Medical Research Methodology 17 (1): 1–15. https://doi.org/10.1186/s12874-017-0336-2.

Schemper, Michael, Samo Wakounig, and Georg Heinze. 2009. “The estimation of average hazard ratios by weighted Cox regression.” Statistics in Medicine 28 (19): 2473–89. https://doi.org/10.1002/sim.3623.

Schmid, Matthias, and Sergej Potapov. 2012. “A comparison of estimators to evaluate the discriminatory power of time-to-event models.” Statistics in Medicine 31 (23): 2588–2609. https://doi.org/10.1002/sim.5464.

Sonabend, Raphael Edward Benjamin. 2021. “A Theoretical and Methodological Framework for Machine Learning in Survival Analysis: Enabling Transparent and Accessible Predictive Modelling on Right-Censored Time-to-Event Data.” PhD, University College London (UCL). https://discovery.ucl.ac.uk/id/eprint/10129352/.

Sonabend, Raphael, Andreas Bender, and Sebastian Vollmer. 2022. “Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures.” Edited by Zhiyong Lu. Bioinformatics 38 (17): 4178–84. https://doi.org/10.1093/bioinformatics/btac451.

Song, Xiao, and Xiao-Hua Zhou. 2008. “A semiparametric approach for the covariate specific ROC curve with survival outcome.” Statistica Sinica 18 (July): 947–65.

Therneau, Terry M. 2015. “A Package for Survival Analysis in S.” https://cran.r-project.org/package=survival.

Therneau, Terry M., and Elizabeth Atkinson. 2024. “Concordance.” https://cran.r-project.org/web/packages/survival/vignettes/concordance.pdf.

Tsouprou, Sofia. 2015. “Measures of Discrimination and Predictive Accuracy for Interval Censored Survival Data.” Master's thesis.

Uno, Hajime, Tianxi Cai, Michael J. Pencina, Ralph B. D’Agostino, and L J Wei. 2011. “On the C-statistics for Evaluating Overall Adequacy of Risk Prediction Procedures with Censored Survival Data.” Statistics in Medicine 30 (10): 1105–17. https://doi.org/10.1002/sim.4154.

Uno, Hajime, Tianxi Cai, Lu Tian, and L J Wei. 2007. “Evaluating Prediction Rules for t-Year Survivors with Censored Regression Models.” Journal of the American Statistical Association 102 (478): 527–37. http://www.jstor.org/stable/27639883.

Wu, Ying, and Richard J Cook. 2020. “Assessing the Accuracy of Predictive Models with Interval-Censored Data.” Biostatistics 23 (1): 18–33. https://doi.org/10.1093/biostatistics/kxaa011.

Wu, Yuan, Xiaofei Wang, Jiaxing Lin, Beilin Jia, and Kouros Owzar. 2020. “Predictive Accuracy of Markers or Risk Scores for Interval Censored Survival Data.” Statistics in Medicine 39 (18): 2437–46. https://doi.org/https://doi.org/10.1002/sim.8547.