Integrating survival curves
· 9 minutes read
A survival curve \(S(t)\)
represents the probability of a given event to happen at a given time point \(t\)
.
It is often estimated using the KaplanMeier (productlimit) estimator.
However, sometimes it is useful to calculate the integral of a survival function, e.g. when computing restricted mean survival time.
The restricted mean is a measure of average survival from time 0 to a specified time point, and may be estimated as the area under the survival curve up to that point.
Formally, the restricted mean survival time \(\mu\)
is the mean of the survival time, limited to some horizon time \(t^*\)
.
It equals the area under the survival curve \(S(t)\)
from time zero to \(t^*\)
:
$$ \mu = \int_0^{t^*} S(u) \ du $$
Most interestingly, the difference in restricted mean survival time between groups provides an alternative measure of treatment effect. Remember: the hazard ratio is not (in general) an appropriate measure of treatment effect in randomised trials with time to event outcomes.
The restricted mean survival time can be easily computed analytically if \(S(t)\)
follows a (piecewise) exponential distribution.
Otherwise, we can always compute the integral above numerically.
The goal of this post is to show how to do so.
Data
Every example needs an example dataset: here I’ll be using data from the German breast cancer study, which comes bundled with Stata.
We can import it in R directly from the web:


## # A tibble: 686 x 14
## id hormon x1 x2 x3 x4 x5 x6 x7 rectime censrec x4a
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 70 2 21 2 3 48 66 1814 1 1
## 2 2 1 56 2 12 2 7 61 77 2018 1 1
## 3 3 1 58 2 35 2 9 52 271 712 1 1
## 4 4 1 59 2 17 2 4 60 29 1807 1 1
## 5 5 0 73 2 35 2 1 26 65 772 1 1
## 6 6 0 32 1 57 3 24 0 13 448 1 1
## 7 7 1 59 2 8 2 2 181 0 2172 0 1
## 8 8 0 65 2 16 2 1 192 25 2161 0 1
## 9 9 0 80 2 39 2 30 0 59 471 1 1
## 10 10 0 66 2 18 2 7 0 3 2014 0 1
## # … with 676 more rows, and 2 more variables: x4b <dbl>, x5e <dbl>
For simplicity, I will only include the treatment of interest (hormon
, which represents hormonal therapy) as a covariate in the model.
The survival time is rectime
(in days), and the event indicator variable is censrec
.
I am also creating a new variable hormon2
that will be useful for plotting:


Modelling
I fit a flexible parametric model using the rstpm2
package and assuming 5 degrees of freedom for the baseline hazard:


## Maximum likelihood estimation
##
## Call:
## stpm2(formula = Surv(rectime, censrec) ~ hormon, data = brcancer,
## df = 5)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## (Intercept) 6.82308 0.78776 8.6613 < 2.2e16 ***
## hormon 0.36445 0.12493 2.9173 0.00353 **
## nsx(log(rectime), df = 5)1 5.35779 0.77579 6.9062 4.977e12 ***
## nsx(log(rectime), df = 5)2 5.88972 0.78949 7.4602 8.640e14 ***
## nsx(log(rectime), df = 5)3 5.01742 0.50802 9.8763 < 2.2e16 ***
## nsx(log(rectime), df = 5)4 9.94593 1.53864 6.4641 1.019e10 ***
## nsx(log(rectime), df = 5)5 4.92988 0.36824 13.3875 < 2.2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## 2 log L: 5213.05
The estimated hazard ratio is 0.695… but we don’t care about this measure, right?
A benefit of modelling the baseline hazard (as compared to a Cox model) is that we can easily obtain smooth fitted survival curves for each combination of covariates of interest. For instance, the fitted survival curves for the two treatment arms are:


So easy!
Computing restricted mean survival time
Now, let’s get down to business.
The fitted model is a flexible parametric model; in this setting, it is not straightforward to obtain an analytical solution.
We can, however, use splineinterpolation followed by numerical integration to compute the above integral very easily, and without needing to install any extra package. It will be a threesteps procedure:
 We obtain a fine grid of predictions for each survival curve that we want to integrate (over followup time);
 We obtain a function representing the spline interpolation of each curve;
 We integrate the function that we just obtained.
 Piece of cake!
Ok, there are four steps above, but don’t we deserve some cake?
Step number one:


Here we set \(t^*\)
as 5 years, hence creating an appropriate grid for time and each treatment arm.
Also, we start from .Machine$double.eps
(roughly 2e16) instead of zero as survival times need to be strictly positive; nevertheless, the difference will be negligible.
Then, onto step two. We predict the survival curve based on this grid:


Done.
Step three consists of fitting an interpolating spline for each fitted survival curve:


Here I am using a natural spline for the interpolation, but other methods work well too (in my experience) given that the function we are trying to interpolate is quite regular.
The cool thing about splinefun
is that it returns a function that we can call.
For instance, to obtain fitted (interpolated) survival at one year:


## [1] 0.9072905


## [1] 0.9346556
Finally, step four consist of using the builtin function integrate
to (what a surprise!) integrate the two functions we just obtained:


We now have restricted mean survival time (in days) for both treatment arms:


## [1] 1266.97


## [1] 1406.486
In years, that will be:


## [1] 3.471151


## [1] 3.853386
Calculating the difference between the two is now trivial:


## [1] 139.5155


## [1] 0.3822341
Turns out, women treated with hormonal therapy have a 0.382years longer time to recurrence compared to controls (and focussing on the first five years of followup only). Cool!
We would need to quantify the uncertainty in our estimate (and there is a closedform formula for the standard error of the difference), but that’s a lesson for another day.
What are we actually calculating?
The two fitted survival curves that we are comparing are the following:


RMST_0
is the area below the survival curve of the control arm:


RMST_1
is the area below the survival curve of the treatment arm:


The difference between RMST_1
and RMST_0
is the area between the curves:


data_grid_diff
is a dataset that was reshaped for plotting purposes, but otherwise contains the same information of data_grid
.
Wrapping up
This post showed how to compute restricted mean survival time after fitting a flexible parametric model, but it can be replicated with every model that yields predictions for the survival function \(S(t)\)
.
I started writing this blog post with the idea of describing how to integrate survival functions in general terms but got carried away with the estimation of restricted mean survival time. Nevertheless, knowing how to integrate a survival function can be broadly useful e.g. when comparing different predictions from competing models or when comparing with the truth (if known, as in simulation studies). And it’s terribly easy!
This example used functions that come bundled with R for simplicity, but other functions and packages can be used as well.
For instance, I prefer using the quadinf
function from the pracma
package for numerical integration, as I found it (empirically, of course) to be more robust and reliable than integrate
.
Other approaches for numerical integration can be used as well, such as Gaussian quadrature or more simple trapezoidal rules (and so on).
As always, let me know if I got something horribly wrong. Cheers!