Let’s start by what is survival data. Survival data is characterized by having individuals that are observed over a period of time until an event of interest takes place, a.k.a failure. This event can be death, occurrence of a disease, malfunction, etc. If failure does not happen during the observation period the observation is considered censored.
Imagine that we are observing a sample of dispensing machines during a period of time. Let’s assume that the researcher has created one column, doe, with the date when each particular machine enters the observation period and a second column, dox, with the date when each machine breaks down or when the observation period ends. In order to run survival analysis, Stata will require two variables: (1) a time variable which is the number of time units (e.g. days) that each individual has been at risk and (2) an outcome variable that can take to values: 1 if failure occurs and 0 if censoring.
In our example the time variable would be the number of days from beginning of the observation period until the machine breaks down (failure) or the end of the observation period (censoring). Note that censoring means that the machine did not break during the observation period.
The command used to declare your data to be survival data is stset. The syntax for our example would be:
stset surv_days, failure (broken=1)
In this post I am talking about single-failure-per-subject-data, in other words, we are only interested in time to first failure. Most survival data will be single-failure since most of the times what you observe is death, and as far as I am concerned a person can only die once. But if you were observing a particular illness, a person could fall ill several times.
Alternatively, Stata also allows survival data to be declared by using date of origin and date of exit instead of survival time. Following our example, the syntax would look like this:
stset dox, failure(broken=1) origin(doe)
I always suggest to run stdescribe which produces a report with the number of subjects, time at risk and number of failures. This allows you to ensure that the survival data was declared correctly.
Now that we have learned how to prepare a dataset for survival analysis, how do we analyze the data?. In this post I will only cover the most basic analysis: the Kaplan-Meier survivor function. Plotting them, is very useful when we want to compare the survival time between groups of individuals. Continuing with our example, imagine that we are interested in comparing the performance of machines from two different companies (1=healthysnacks; 2=nuttysnacks).
We can use the command sts graph with the by() option:
sts graph, by(brand)
It is pretty clear by looking at the plot, that nuttysnacks machines have longer survival times, indicated by its survival curve being above the one for healthysnacks. Take time 200 days for instance: the survival rate for machines from healthysnacks is about .25 compared to about .60 for nuttysnacks machines. In other words, by day 200 only 75% of the healthysnacks machines have not broken. I recommend to accompany this visual representation by the log-rank test for equality of survivor functions using the sts test command. This test provides the p-value for the null hypothesis that both survivor functions are equal. In our example the p-value is p=.0180, therefore we would conclude that the nuttysnacks machines significantly break down less often than healthysnacks machines.