Survival Analysis — An Introduction
Before diving into the topic at hand, I would like to mention that the aim of this series, about Survival Analysis, is to put together the large amount of information already available in public domain and give the intuition behind the concept. Please check out the reference section for references.
Let us start with defining the basic terms used in survival analysis,
Survival analysis is a statistical method which analyses data to predict the time of occurrence of one or more event.
The event can be death, occurrence of a disease, marriage, malfunctioning of a machine, etc. The event of interest should be well defined at a specific point of time without any ambiguity.
Time from the start of the study till the occurrence of event is called survival time.
When we don’t have the true survival time of the system, the truncated survival time is known as censoring. Censoring happens when,
- The event doesn’t occurs during the study.
- The individual withdraws from the study.
- The individual is lost to follow-up during the study.
There are 3 types of censoring namely,
- Right censoring
- Left censoring
- Interval censoring
Right censoring is the most common form of censoring. Right censoring can be best explained with the example as follows,
Consider 3 patients A,B and C who are part of a clinical study.
Patient A: Experiences event before the end of study.
Patient B: Doesn’t experience the event during the study.
Patient C: Withdrew from the study midway and we don’t have any idea as to whether the patient experienced the event or not after withdrawing from the study.
Since patient A has experienced the event during the study, the survival time is known without any ambiguity and hence the survival time of patient A is not censored. But in the case of patient B and C we don’t know the exact time of occurrence of the event except that we know that the patient hasn’t experienced the event till they were under observation. So the survival time is truncated to the time uptil which they were observed. This is called right censoring. From this it is clear that the patient’s true survival time is greater than or equal to the observed(or censored) survival time.
Left censoring is when the actual survival time is less than the observed survival time. Let us illustrate this with the below example,
Suppose we place an individual understudy for a particular viral infection. Upon testing the patient, it is observed that the patient has already been infected by the virus, now since we don’t know the exact time at which the patient was infected with the virus, we have to truncate the survival time from the left. This is called left censoring.
For a similar study for viral infection, If the patient was tested negative at time t1 and later at time t2 the patient was tested positive, the patient was infected by the virus between time t1 and t2 but we cannot find the exact time of infection. Hence we censor the time. This type of censoring is interval-censoring.
Since right censoring is the most commonly occurring scenario, in upcoming sections when mentioned as censoring, it is assumed to be right-censored.
Mathematical Notations in Survival Analysis:
T — A random variable for survival time
t — A specific value of interest for T. e.g., (T > t = x) means whether the survival time is beyond x unit time. Time is measure in units of days, weeks, months or years.
d — Random variable denoting the occurrence of event or censoring.
(d = 1 denotes the occurrence of the event, d = 0 denotes censoring)
Survival data are generally described and modelled in terms of two type of quantitative functions, namely Survival and Hazard.
Survival(or Survivor) function, S(t), is the probability that the event does not occur from the time of origin of study till the specific time t.
Following are the characteristics of a survivor function,
- S(t) decreases as t increases.
- S(t) = 1 at the start of the study. i.e at t = 0.
- S(∞) = 0. Sooner or later everyone will experience the event.
Hazard function, h(t), is the probability that an event occuring at a particular time t. To put it in another way it is the instantaneous rate of change of probability of an event at time t. Hazard function gives the risk of having an event at any instant, given the individual has survived till that instant.
P(t<T≤t+∆t | T>t) is the probability that the event occurs somewhere between t and t+∆t.
So as ∆t approaches 0, we get the instantaneous risk of having an event.
It is important to note here that Hazard function is not a probability as it is divided by time. Hence the hazard value ranges from 0 to ∞.
Relationship between Survival and Hazard function:
Unlike Survival function, computing Hazard is difficult. One way to overcome was to find a relationship between Survival and Hazard function and calculate hazard using the survival function.
Cumulative hazard function H(t),
let us define f(t) as,
From Eqn. 4,
Now let us take differentiation of log of S(t)
Above conclusion was arrived using chain rule in differentiation,
In the above equation, (t<T≤t+∆t) is the subset of (T>t). Hence,
P( (t<T ≤t+∆t) ∩ (T> t) ) =P(t < T≤ t+∆t)
we know that,
S(t) = P(T>t)
From Equation 3,
Now Equation 6 becomes,
Taking integral on both sides,
This concludes the proof that establishes the relationship between survival and hazard function.
From the above equation, it is intuitively evident that the Survival function decrease with the increase in Hazard.
In this tutorial we saw the fundamentals of survival analysis and got familiar with the terms and notations used in Survival analysis. In the upcoming posts we will look into different statistical methods available for finding and comparing the survival and hazard function.