RMST and survival forests
· 5 minutes read
This is a followup from my previous post on integrating survival curves and computing restricted mean survival times (RMST), which concluded with:
[…] it can be replicated with every model that yields predictions for the survival function
\(S(t)\)
.
This made me think: can we actually compute the difference in RMST after fitting a random survival forest? Well, it turns out that yes, we can!
Let’s use the same dataset from the German breast cancer study:


We will be using the randomForestSRC
package to fit a random survival forest.
There are other options (such as the cforest
function from the party
package), but we’ll stick with randomForestSRC
for simplicity.


The model we fit is the same model as before, with just treatment as a binary covariate:


Again, for simplicity, we’ll use the default arguments of rfsrc
(e.g. the number of trees to grow for the algorithm); we also set a seed for reproducibility.
Let’s print the model fit:


## Sample size: 686
## Number of deaths: 299
## Number of trees: 1000
## Forest terminal node size: 15
## Average no. of terminal nodes: 2
## No. of variables tried at each split: 1
## Total no. of variables: 1
## Resampling used to grow trees: swor
## Resample size used to grow trees: 434
## Analysis: RSF
## Family: surv
## Splitting rule: logrank *random*
## Number of random split points: 10
## Error rate: 67.66%
We don’t really care about the model accuracy here, as all of this is just a proof of concept.
randomForestSRC
provides a predict
method, which we can use to obtain prediction on a test dataset of two individuals (one treated, one untreated):


We need to provide a value for rectime
and censrec
, despite it not being used by predict
; by default, predictions are computed at each observed event time.
Interestingly, a variety of predictions are returned (e.g. survival probability, cumulative hazard, etc.). We can easily extract the survival predictions, which is a matrix with as many rows as individuals in the test dataset and as many columns as the number of distinct observed event times:


## [1] "matrix" "array"


## [1] 2 270
We process the prediction to be a tidy dataset:


## rectime 0 1
## 1 72 0.9975730 1.0000000
## 2 98 0.9952686 1.0000000
## 3 113 0.9929122 1.0000000
## 4 120 0.9905855 1.0000000
## 5 160 0.9881725 1.0000000
## 6 169 0.9881725 0.9958175


## # A tibble: 6 x 3
## rectime hormon S_hat_RF
## <dbl> <chr> <dbl>
## 1 72 0 0.998
## 2 72 1 1
## 3 98 0 0.995
## 4 98 1 1
## 5 113 0 0.993
## 6 113 1 1


We can finally plot the fitted survival curves from the random survival forest model:


Not too bad!
Let’s now compute the fitted survival curves from the flexible parametric model, as a comparison:


The two models seem to agree fairly well.
Let’s finally calculate the difference in RMST using the predictions from the random forest model:


Here we use the whole fitted survival curve to fit the spline interpolation. The RMST can finally be calculated as:


## [1] 1260.773


## [1] 1413.315
The difference in RMST will therefore be RMST_1  RMST_0
= 153 days, or equivalently, 0.418 years.
Remember: using a flexible parametric model, the estimated difference in RMST was 140 days (0.382 years).
In conclusion, yes, it is possible to calculate RMST after fitting a random survival forest model. The example that was described above is kind of silly (I mean, we included a single binary covariate in a random forest model), but still kind of useful to illustrate how to do that in R. I guess I satisfied my curiosity — for now…