5 Event-history Analysis
This page is a work in progress and major changes will be made over time.
This section generalises the single-event setting by viewing time-to-event data as data in which one could observe multiple, potentially mutually exclusive events. In this generalisation, data is sometimes refered to event-history data and its analysis as event-history analysis (EHA). Table 5.1 classifies the different types of data that could occur in EHA, differentiated between the number of occurences of the events (rows) and whether the events are of same or different types (columns).
Same event type | Competing event types | |
---|---|---|
One/first event | single event | competing risks |
Multiple events | recurrent events | multi state |
One can think about event history in terms of transitions between different states, as illustrated in Figure 5.1. Usually, a subject starts out in state 0 (for example, ‘healthy’) and transitions to different states from there. States from which further transitions are possible are called transient, otherwise a state is called terminal. If no transition happens within the follow-up time, the subject remains in the initial state, in event history terms the observation is censored for this transition.
In the single event setting (Figure 5.1, left panel), a subject can only transition to one state (the event of interest). In the competing risks setting (Figure 5.1, middle panel, see Section 5.1), a subject could transition to any of the \(K\) mutually exclusive states, thus the subject is at risk for a transition to multiple states. However, once one of them occurs, it is impossible to transition to another (for example, death from different causes). In both cases, the event needs not be terminal, we just end the observation after the first occurence of an event.
In the most general case, the multi-state setting (Figure 5.1 right panel, see Section 5.2), there are multiple transient and terminal states with potential back transitions (for example, moving between different stages of an illness with the possibility of (partial) recovery and death as terminal event). The recurrent events setting is as a special case of the multi-state setting but could also be treated as a more general case of the single event setting with correlated observations (Section 5.3 and Figure 5.3) .
5.1 Competing Risks
In constrast to single-event survival analysis, competing risks are concerned with the time to the first of many, mutually exclusive events. Mutually exclusive here means that once one of the events occurs, we can not observe the others (at the same time). As in the single event case, the events may not necessarily be terminal (such as death), instead they just signify the end of the period of interest for a given observation.
Formally, let \(Y\) be the time-to-event as before. In the competing risks setting, there is now an additional random variable \(E\in \{1,\ldots,k\}\) with realisations \(e\), which denotes one of \(k\) competing events that can occur at event time \(Y\). The survival outcome notation, \((T, \Delta)\) has to be extended to accomodate for the possibility of competing risks. Previously, we defined the status indicator \(\Delta \in \{0,1\}\) for right-censored data. In the competing risks setting we have \(\Delta = I(Y_i \leq C_I) \wedge E_i = e\), such that \(\Delta \in \{0, 1, \ldots, k\}\) where \(\Delta = 0\) indicates censoring as before and \(\Delta = e\) means the \(e\)th event occurred.
A common approach to competing-risks analysis is known as cause-specific hazards, where \(h_{e}\) is defined as the hazard for the transition from initial state \(0\) to state \(e =1,\ldots,k\):
\[ h_{e}(\tau) = \lim_{\Delta \tau \to 0} \frac{P(\tau \leq Y \leq \tau + \Delta \tau, E = e\ |\ Y \geq \tau)}{\Delta \tau}, \; e = 1, \dots, k. \]
As competing events are mutually exclusive at any time \(\tau\), it is useful to further define the all-cause hazard which is the hazard of any event occuring:
\[ h(\tau) = \sum_{e = 1}^{k}h_{e}(\tau) \]
the all-cause cumulative hazard
\[ H(\tau) = \int_{0}^\tau h(u)\ du = \sum_{e=1}^{k}\int_{0}^\tau h_{e}(u)\ du = \sum_{e=1}^k H_{e}(\tau) \]
and the all-cause survival probability
\[ S(\tau) = P(Y > \tau) = \exp(-H(\tau)) \]
While notationally identical to the survival probability in the single-event case, the all-cause survival probability in the competing risks setting implies that none of the events occurred before \(\tau\). It is often not estimated directly but calculated from the cause specific hazards instead.
Another important quantity in competing risks analysis is the Cumulative Incidence Function (CIF), which denotes the probability of experiencing an event \(e\) before time \(\tau\)
\[ F_{e}(\tau) = P(Y \leq \tau, E = e) = \int_{0}^\tau S(u)h_{e}(u)\ \mathrm{d}u. \]
Intuitively the contributions in the integral consist of the probability of not experiencing any event before time \(u\), \(S(u)\), multiplied with the hazard of experiencing an event of type \(e\), \(h_{oe}(u)\) immediately at time \(u\). Calculating the CIF (in addition to the cause-specific hazards) is particularly important when making statements about the absolute risk (probability) for an event of type \(e\), as it depends on all cause-specific hazards via \(S(\tau)\).
This will be particularly relevant when assesing how features affect the probability for observing a specific event. While some features might strongly affect some of the cause-specific hazards, if an event has overall low prevalence, then the absolute probability of observing the event will still be low. Hence the CIF is useful for incorporating this information into an interpretable quantity.
5.2 Multi-state Models
Multi-state models can be considered the most general type of survival data, as other types (single-event, competing risks, recurrent events) can be viewed as special cases.
A common multi-state model, the illness-death model, is illustrated in Figure 5.2, left panel. The states are ‘healthy’ (State 0), with illness (State 1), or death (State 2). Healthy subjects can transition directly to death (\(0\rightarrow 1\)) or after previous illness (\(0\rightarrow 1 \rightarrow 2\)), without the possibility of back-transitions. Similar to competing risks, individual transitions are characterised by transition-specific hazards, denoted by \(h_{\ell\rightarrow e}(\tau)\) or short \(h_{\ell e}(\tau)\), where \(\ell\) is the starting state and \(e\) the end state for a transition, \(\ell, e \in \{0,\ldots,k\}\). Remaining in the same state (transition into the same sate \(h_{\ell\ell}(\tau)\)) is equivalent to being censored for any transition possible from that state.
The right panel of Figure 5.2 depicts a more general multi-state setting with back-transitions. One can think of transitions between a healthy state (0) to different disease progressions (1, 2) with possibilty of improvement and recovery, and possible transition to a terminal state (3).
When modeling such data, interest often lies in estimation of the transition hazards, but also in transition probabilities \(P_{\ell e}(\zeta, \tau)\), that is the probability to transition from state \(\ell\) to \(e\) between time points \(\zeta\) and \(\tau\) or state occupation probabilities \(P_e(\tau)\), the probability to occupy state \(e\) at time \(\tau\). Transition probabilities are often summarized in a matrix of transtion probabilities \(\mathbf{P}(\zeta, \tau)\), where rows indicate starting states and columns indicate end states.
For example, for the illness-death model, the transition matrix would be given as
\[\mathbf{P}(\zeta, \tau) := \begin{pmatrix} P_{00}(\zeta, \tau) & P_{01}(\zeta, \tau) & P_{02}(\zeta, \tau)\\ \bullet & P_{11}(\zeta, \tau) & P_{12}(\zeta, \tau)\\ \bullet & \bullet & P_{22}(\zeta, \tau)\\ \end{pmatrix}. \]
\(\bullet\) indicates impossible transitions, as in the illness-death model there are no transitions from higher to lower states.
5.3 Recurrent Events
The recurrent events setting is defined by the same event type occurring multiple times. While the recurrent events setting can be viewed as a special case of the multi-state setting (Section 5.2), we treat it here separately as it can also be approached as single event data with correlation. Some examples of recurrent events include recurrence of infectious diseases, episodes of epilepsy, and replacements of machine parts.
Table 5.2 shows data of chronic granolotomous disease (CGD) for two subjects. Note the start-stop notation of the data (\(t_{start}, t_{stop}\) respectively) that indicates the time in which a subject enters the risk set for the kth event number. The column \(\delta\) indicates whether the subject experienced the \(k\)th outbreak or was censored. For example, Subject 1 (first three rows) experienced the first outbreak (first row) after 219 days, at which time they entered the risk set for event number 2. The second event (second row) was experienced at 373 days and the subject was then censored for the third event (third row) after 414 days. This notation is equivalent to a multi-state representation of the recurrent events process, where each event number represents a different state. In Figure 5.3, right panel, this is refered to as “Clock forward” approach.
cgd
(Therneau 2015) time-to-event dataset. Rows are individual observation units (\(ID\)), \(k\) is the \(k\)th event occurrence, \(t_{start}\) and \(t_{stop}\) mark the beginning and end of the “at risk time” for the \(k\)th event occurence, \(\delta\) is the event indicator and \(gap = t_{stop} - t_{start}\).
ID | \(k\) | \(t_{start}\) | \(t_{stop}\) | \(\delta\) | \(gap\) |
---|---|---|---|---|---|
1 | 1 | 0 | 219 | 1 | 219 |
1 | 2 | 219 | 373 | 1 | 154 |
1 | 3 | 373 | 414 | 0 | 41 |
7 | 1 | 0 | 292 | 1 | 292 |
7 | 2 | 292 | 364 | 0 | 72 |
For recurrent events data without competing (terminal) events, the “Clock reset” approach (Figure 5.3, left panel) can be used. Instead of representing the data in the start-stop notation, the gap time is calculated instead (see fourth column in Table 5.2). In this representation, we only consider how long the subject was at risk for the \(k\)th event, not at which time (on the calendar time scale). It has the advantage that standard methods for analysis of single-event data can be used for modeling. However, the resulting dependency between multiple data points from one subject (at risk for multiple events at the same time) needs to be taken into account in the modeling step.
The distribution of event recurrences per subject is often right-skewed in real data sets, which implies that there are only few events past event number \(e^*\). In practice, therefore, the data is often summarized such that we have event numbers \(e = 1,\ldots, e^*_+\), where \(e^*_+\) represents category “\(e^*\) or higher”.