Alessandro Gasparini

Using simulation to study operating characteristics of stepped wedge cluster-randomised trials

Sat, 22 Feb 2025 23:00:00 GMT

Hi everyone,

We recently published an article in Statistics in Medicine on the use of joint longitudinal-survival modelling to accommodate informative dropout in longitudinal stepped wedge cluster-randomised trials. It’s the result of many moons of work, and I am extremely pleased that it is finally out - make sure to check it out!

While the paper focuses on the use of joint modelling to obtain unbiased treatment effect estimates, there is an interesting aspect of this work that (I think) is probably a bit underrated. Specifically, we developed an algorithm to simulate data from stepped wedge trials with different outcome types (continuous, binary, count) in the presence of dropout, either informative or not. This is implemented in the {simswjm} R package, which is openly available from the Red Door Analytics GitHub page.

Using this implementation of the simulation algorithm, we can use Monte Carlo simulation methods to study the operating characteristics of stepped wedge longitudinal trials in the presence of dropout. For instance, how much efficiency do we lose if some subjects do not complete the trial? What if the dropout rate is different between treated and non-treated participants? Well, we now have the possibility of studying this empirically using these new tools that we developed.

Say you have a stepped wedge trial with a continuous outcome, 4 intervention sequences, 5 periods, and 2 clusters randomised to each intervention sequence. Let’s assume a within-individual intra-class correlation coefficient (ICC) and a between-individuals ICC (same parameters from the simulation study in the paper). Moreover, let’s assume non-informative dropout at a constant rate, for simplicity but also to highlight some side effects of dropout that do not introduce bias in the analysis if unaccounted for.

We can now simulate 10,000 trials, where each cluster recruits between 15 and 150 participants, and study the inflation of standard errors in the presence of dropout. Specifically, we can analyse both the complete data (if no dropout occurred) and the actual observed data (where dropout happens) for each trial using the usual extension of the Hussey and Hughes model introduced by Baio et al. and compare the estimated standard errors of the estimated treatment effect.

If we plot the ratio of these standard errors for each simulated trial, we obtain the following plot:

What we can see here is that, irrespective of how many subjects are recruited by each cluster, standard errors are 15/20% larger in the presence of dropout, on average. Given that dropout is not informative we are not getting biased estimates of the treatment effect (check out the results in the published paper for reference), but we lose efficiency thus affecting statistical inference. This provides a powerful tool that can be used at the design stage to ensure that study power is preserved if dropout was to happen.

Obviously, this is somewhat naive and for illustration purposes only - you can of course tweak all the data-generating parameters above and focus on different aspects or operating characteristics of a trial, but you get the idea. This is not rocket science, but I think it shows (once again!) how powerful of a tool statistical simulation can be. As someone much smarter than I am once said, if you can simulate it then you can understand it. Or something like that, I don’t know.

Anyway, remember to check out the paper and the R package, and just get in touch if you have any thoughts on this. You can also replicate the example above using the R script archived here and posted below:

Standardised survival probabilities in R

Mon, 25 Mar 2024 23:00:00 GMT

Hi everyone,

You might already know that a couple of years ago Betty Syriopoulou published a tutorial paper on standardised survival probabilities, a useful tool to supplement and enhance the reporting of regression models for time-to-event (survival) outcomes.

The background, in short, is that while hazard ratios are ubiquitous in medical research with time-to-event outcomes, they are often misreported, misinterpreted, and have several, well-known limitations; I will not go much into detail, but some relevant references if you want to read more about this topic are:

Betty’s tutorial paper on standardised survival probabilities (of course);
The popular paper by Miguel Hernán on The Hazards of Hazard Ratios;
The paper by Mats Stensrud and Miguel Hernán questioning Why Test for Proportional Hazards.

Standardised survival probabilities provide a useful and easy-to-interpret measure to complement hazard ratios, can be calculated after fitting survival models (and can therefore include covariates to adjust for, e.g., because they are confounders for a certain exposure-outcome association of interest), and can be contrasted to produce standardised survival probability differences (i.e., risk differences) and ratios (i.e., risk ratios). As a bonus, if the confounders that you decide to include in your regression model are sufficient to control for confounding, then the estimated risk differences (or ratios) can be interpreted as population causal effects: nice! More on this causal interpretation here and here.

The tutorial paper is easy to follow and illustrates the utility of such measures very nicely, but most importantly, Stata code to reproduce the analyses and use this method in practice is provided in the supplementary material.

Unfortunately, only Stata code was published with the paper, but Betty and I have been talking about porting this to R for a while now, and guess what: today is your lucky day! We have just published on my GitHub profile a repository with R code that can be used to replicate every result from the paper, including every plot that was included in the tutorial. The GitHub repository can be found here (https://github.com/ellessenne/standsurv-tutorial-r), and the code is released under the MIT license, so you can pretty much do whatever you want with it.

Here is a little preview of what you will be able to accomplish:

…and there you have it, don’t forget to check out the code and go test this out for your next project, predict standardised probabilities, make some nice plots, complement your hazard ratios, and take your survival analyses to the next level.

Finally, please note that the code in the repository is quite specific to this project and not very general, so you will have to do a little bit of work to get this going, but there should be enough comments to give you all the information you will possibly need to extend it. If you have any questions, feel free to get in touch or just open an issue in the GitHub repository. As always, any feedback is more than welcome.

Before we go, here is one more treat for you, if you are still reading: an R package with a general, easy to use implementation of regression standardisation may or may not be in development. Stay tuned for more… 👀

New {rsimsum} who this?

Sat, 02 Mar 2024 23:00:00 GMT

Hi everyone,

Let’s get straight to business: a new release of the {rsimsum} package just landed on CRAN, and you can install it with

install.packages("rsimsum")

It’s a quite large release that is more than a year in the making, with lots of new features, a few bug fixes, and a lot of internal housekeeping (that you don’t need to worry about, this should not affect any user-facing behaviour – but please get in touch if it does). I’ll summarise a few things worth highlighting in this blog post; everything else is listed in the changelog page on the package website.

As always, I am extremely grateful to all users of {rsimsum} who use and break the package and come up with bug reports and feature suggestions: the package is much better because of you. And if you have further bug reports or suggestions, or any feedback on the package really, feel free to get in touch (contact details here) or post on GitHub.

Relative bias

We start with relative bias, which is now a supported summary measure that is reported by default, including Monte Carlo errors. I hinted at this in a previous blog post, and it’s finally here!

Let’s have a look at this in more detail. Specifically, we use the MIsim dataset, which is bundled with {rsimsum}:

library(rsimsum)
data("MIsim", package = "rsimsum")
s <- simsum(
  data = MIsim,
  estvarname = "b",
  true = 0.50,
  se = "se",
  methodvar = "method",
  ref = "CC"
)

If we summarise this, we can now pick bias and rbias as performance measures:

summary(s, stats = c("bias", "rbias"))

Values are:
    Point Estimate (Monte Carlo Standard Error)

Bias in point estimate:
              CC         MI_LOGT             MI_T
 0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)

Relative bias in point estimate:
              CC         MI_LOGT             MI_T
 0.0335 (0.0096) 0.0018 (0.0083) -0.0024 (0.0085)

…and that’s it, as easy as that.

Of course, this won’t work if the true value of the estimand is zero, but that is not surprising (as we divide by zero):

simsum(
  data = MIsim,
  estvarname = "b",
  true = 0,
  se = "se",
  methodvar = "method",
  ref = "CC"
) |>
  summary(stats = "rbias")

Values are:
    Point Estimate (Monte Carlo Standard Error)

Relative bias in point estimate:
        CC   MI_LOGT      MI_T
 NaN (NaN) NaN (NaN) NaN (NaN)

Nonetheless, this provides a useful alternative that allows quantifying bias on a relative scale (compared to the absolute scale of plain bias). If you are looking for more details, formulae are described in the introductory vignette (which has been updated accordingly).

Nested loop plots

Nested loop plots can now accommodate designs that are not fully factorial. Let’s start with a fully factorial design, as in the nlp dataset:

data("nlp", package = "rsimsum")
s.nlp <- simsum(
  data = nlp,
  estvarname = "b",
  true = 0,
  se = "se",
  methodvar = "model",
  by = c("baseline", "ss", "esigma")
)

library(ggplot2)
autoplot(s.nlp, stats = "bias", type = "nlp")

This has not changed since the previous release. If we however have an incomplete design, missing scenarios will now be empty:

nlp.subset <- subset(nlp, !(nlp$ss == 100 & nlp$esigma == 2))
s.nlp.subset <- simsum(
  data = nlp.subset,
  estvarname = "b",
  true = 0,
  se = "se",
  methodvar = "model",
  by = c("baseline", "ss", "esigma")
)
autoplot(s.nlp.subset, stats = "bias", type = "nlp")

This should allow using nested loop plots in a wider variety of scenarios. Many thanks to Mike Sweeting for the suggestion.

Zipper plots

Lastly, it is now possible to further customise the appearance of zipper plots; specifically, the horizontal lines that are used to denote the estimated confidence intervals (based on Monte Carlo errors) of coverage probability under each scenario. The default behaviour hasn’t changed, with yellow bands:

data(relhaz, package = "rsimsum")
relhaz <- subset(relhaz, relhaz$baseline == "Weibull" & relhaz$model != "RP(2)")
s.zip <- simsum(
  data = relhaz,
  estvarname = "theta",
  se = "se",
  true = -0.50,
  methodvar = "model",
  by = "n",
  x = TRUE
)
autoplot(s.zip, type = "zip")

What is nice is that now you can use the zip_ci_colors argument of autoplot() to customise this. If you pass a single colour name (or hex value), that will be used throughout:

autoplot(s.zip, type = "zip", zip_ci_colours = "purple")

If you pass two values, those will be used for optimal and sub-optimal coverage, respectively:

autoplot(s.zip, type = "zip", zip_ci_colours = c("green", "red"))

Finally, if you pass three values, those will be used for optimal coverage, under-coverage, and over-coverage, respectively:

autoplot(s.zip, type = "zip", zip_ci_colours = c("green", "blue", "red"))

Of course, I would suggest picking better colours for a publication-worthy plot, but you get the gist. Many thanks to Lorenzo Guizzaro for the suggestion.

The blog is back

Tue, 02 Jan 2024 23:00:00 GMT

Hi everyone,

Long time no see, right?

The last post I wrote on this blog is almost a year old, and a lot has happened in the meantime, behind the scenes. I won’t go into details (if you know, you know!), but 2023 has been a tough year for a bunch of different reasons. But we are back, baby!, and this time I will try to contribute content more consistently.

If you are new around here, you’ll notice that things look a bit different. For the keen observers in the audience this won’t be a surprise, but this blog is now powered by the amazing Quarto, with a custom, hand-crafted theme to bring the old website design into 2024. Gone is the old International Orange; welcome to Very Peri, the former Pantone color of the year for 2022. The new main typeface is Atkinson Hyperlegible for improved accessibility, and Ubuntu Condensed and Fira Code are used for headers and code, respectively.

I think the new design is great, and I hope you’ll like it too. Please have a look around and let me know if you spot anything broken.

I am keeping this post short and sweet, so that’s it for now, but I’ll be back very soon with some news I have already started working on. In the meantime, hope you had a nice holiday season and a happy new year!

Creating dumbbell plots in R

Mon, 09 Jan 2023 23:00:00 GMT

Hi everyone,

Happy new year! I hope you had a relaxing holiday season and that 2023 is treating you well so far.

Well, here’s another treat for you: today we are going to make a dumbbell plot from scratch, using our dear old friend {ggplot2}. Something quick and easy to get going in 2023, but fun nonetheless - and hopefully, useful too. Let’s start by defining what a dumbbell plot actually is:

A dumbbell plot (also known as a dumbbell chart, or connected dot plot) is great for displaying changes between two points in time, two conditions, or differences between two groups.

Source: amcharts.com

You might have seen this before in one of the nice visualisations that the OECD publishes from time to time:

As you can see, this is an intuitive way of showing how a certain metric has changed between two points in time. Let’s get going then, shall we?

For this example, we will use data on monthly step counts that yours truly logged in 2021 and 2022. One of the things I wanted to do more of in 2022, compared to 2021, was walking; will I have succeeded with that? Well, we’ll find out soon.

I extracted this data from Garmin Connect, as I have been wearing a Garmin watch for the past few years now, and this is stored in a dataset named dt:

head(dt)

# A tibble: 6 × 3
  Month     X2021  X2022
         
1 January  114171 194624
2 February 118548 223310
3 March    105853 224946
4 April    172499 206213
5 May      158913 246563
6 June     166119 244314

A very simple dataset, not much to see here. Let’s start building our plot: first, we create a ggplot object and put the different months on the vertical axis:

library(ggplot2)

db_plot <- ggplot(dt, aes(y = Month))
db_plot

Not much to see yet. Then, we add a set of points for 2021 data:

db_plot <- db_plot +
  geom_point(aes(x = X2021, color = "2021"))
db_plot

Same as before, but we add data for 2022:

db_plot <- db_plot +
  geom_point(aes(x = X2022, color = "2022"))
db_plot

It’s coming together nicely, isn’t it? Now, we add a segment to join the two sets of points:

db_plot <- db_plot +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022))
db_plot

And there you have it. Thank you for reading and… wait! We are not done here, of course - now we need to turn this into a nice plot.

One problem here is that the segment overlaps the data points: that looks ugly. To solve this, we need to rebuild the plot but add the segment geometry first:

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022)) +
  geom_point(aes(x = X2021, color = "2021")) +
  geom_point(aes(x = X2022, color = "2022"))

Already better! Then, let’s make the data points larger and change the colour of the bar to grey:

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
  geom_point(aes(x = X2021, color = "2021"), size = 3) +
  geom_point(aes(x = X2022, color = "2022"), size = 3)

Let’s add a better scale for the horizontal axis; for this, we use the comma() function from the {scales} package:

library(scales)

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
  geom_point(aes(x = X2021, color = "2021"), size = 3) +
  geom_point(aes(x = X2022, color = "2022"), size = 3) +
  scale_x_continuous(labels = comma)

Let’s label the plot correctly:

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
  geom_point(aes(x = X2021, color = "2021"), size = 3) +
  geom_point(aes(x = X2022, color = "2022"), size = 3) +
  scale_x_continuous(labels = comma) +
  labs(x = "Steps", y = "", color = "Year")

Now, the final touches: let’s change the theme of the plot and tidy things up a little:

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
  geom_point(aes(x = X2021, color = "2021"), size = 3) +
  geom_point(aes(x = X2022, color = "2022"), size = 3) +
  scale_x_continuous(labels = comma) +
  theme_bw(base_size = 12) +
  theme(legend.position = "bottom", plot.margin = unit(x = rep(1, 4), units = "lines")) +
  labs(x = "Steps", y = "", color = "Year", title = "Monthly Steps Walked, 2022 vs 2021")

We can also simplify the above by turning our data into long format:

library(tidyr)

dt_long <- pivot_longer(data = dt, cols = starts_with("X"))
dt_long$name <- factor(dt_long$name, levels = c("X2021", "X2022"), labels = c("2021", "2022"))
head(dt_long)

# A tibble: 6 × 3
  Month    name   value
        
1 January  2021  114171
2 January  2022  194624
3 February 2021  118548
4 February 2022  223310
5 March    2021  105853
6 March    2022  224946

The required code is very similar to what was used above, but we can now easily modify the colour palette too, and improve the title using {ggtext}:

library(ggtext)

ggplot(dt, aes(y = Month)) +
  geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
  geom_point(data = dt_long, aes(x = value, color = name), size = 3) +
  scale_x_continuous(labels = comma) +
  scale_color_manual(values = c("#F5DF4D", "#6667AB")) +
  theme_bw(base_size = 12) +
  theme(
    legend.position = "none",
    plot.title = element_markdown(),
    plot.margin = unit(x = rep(1, 4), units = "lines")
  ) +
  labs(
    x = "Steps",
    y = "",
    color = "Year",
    title = "Monthly Steps Walked,
    2022
    vs
    2021"
  )

And yes, for all you colours nerds out there: the two hex codes are Pantone’s colours of the year for 2021 and 2022, “Illuminating” and “Very Peri”. Fitting, right?

And yes, I did walk more in 2022 compared to 2021, it turns out! Well, except for October, but to be fair, I did run 120 km that month in 2021 compared to a shameful 0 km in 2022, so it could have been worse…

So there you have it: a short tutorial on building a dumbbell plot from scratch using {ggplot2} and other freely available tools, and making it nice (subjectively, of course). Other options for making dumbbell plots in R do exist, of course, such as the {ggalt} package - make sure you check that out too. And until next time, take care!

Monte Carlo errors for relative bias

Sat, 12 Nov 2022 23:00:00 GMT

Hi everyone,

Today we will be talking about simulation studies and Monte Carlo errors. Yeah, I know, you heard it all before, but trust me… there will be something interesting in here, I swear!

Let’s start with the basics: Monte Carlo simulation is a stochastic procedure, thus for each run of a simulation study, we are likely to get slightly different results. And that’s fine! However, we want to be able to reproduce simulation results. Sure, we could set the seed of the random numbers generator, but what if we want the results to be similar regardless of what seed we set?

This concept is often referred to as statistical reproducibility: in other words, we want to minimise simulation error for results to be similar across replications.

This simulation error is often referred to as Monte Carlo error. And as you might already know, the {rsimsum} package can calculate Monte Carlo error for a variety of performance measures, such as bias.

However, what if you want to calculate Monte Carlo error for any potential performance measure out there? Well, today is your lucky day: as suggested by White, Royston, and Wood (2011) in the settings of multiple imputation, we can calculate the Monte Carlo error using a jackknife procedure to our simulation results. That is, the Monte Carlo error for any performance measure will be the standard error of the mean of the pseudo values for that statistic, computed by omitting one simulation repetition at a time.

Throughout this post, I will introduce all the above concepts, and I will show you how to use the jackknife to compute the Monte Carlo standard error for relative bias. We will also validate our calculations, which is always a good thing to do!

Jackknife resampling

The jackknife technique is, technically speaking, a cross-validation technique and thus it can be considered to be a form of resampling. Specifically, given a sample of size n, a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size (n - 1), where the n^th observation is omitted each time. Loosely speaking, this is not far from the concept of leave-one-out cross-validation.

If we define a given statistics θ, for a sample of size n, we can calculate n jackknife replicates, denoted with θⁿ, each computed by omitting the n^th observation. Then, we can calculate the mean and the variance (and thus standard error) of such jackknife replicates, and there you go.

I know it is a short and not comprehensive at all description: if you want to read more about this, check the Wikipedia page on jackknife resampling and this paper by Bradley Efron.

Relative bias

We now move on to relative bias. First, let’s define bias: according to Morris, White, and Crowther, that is defined (for a certain estimand θ) as: and estimated by assuming n repetitions.

This is routinely computed (and reported) for, anecdotally, most simulation studies in the literature. However, the magnitude of bias depends on the magnitude of θ; it is thus not necessarily easy to compare across studies with different data-generating mechanisms. That’s why it is sometimes interesting to compute relative bias, usually defined as and estimated by The obvious downside is that relative bias, as defined here, cannot be calculated when the true θ = 0, but that’s beyond the scope of this blog post.

Let’s not illustrate how to use the jackknife to calculate the Monte Carlo standard error of relative bias. To do so, we will use a dataset that comes bundled with {rsimsum} for illustration purposes:

library(rsimsum)
data("MIsim")
str(MIsim)

tibble [3,000 × 4] (S3: tbl_df/tbl/data.frame)
 $ dataset: num [1:3000] 1 1 1 2 2 2 3 3 3 4 ...
 $ method : chr [1:3000] "CC" "MI_T" "MI_LOGT" "CC" ...
 $ b      : num [1:3000] 0.707 0.684 0.712 0.349 0.406 ...
 $ se     : num [1:3000] 0.147 0.126 0.141 0.16 0.141 ...
 - attr(*, "label")= chr "simsum example: data from a simulation study comparing 3 ways to handle missing"

MIsim <- subset(MIsim, MIsim$method == "CC")

Remember that, for this specific dataset, the true value of the estimand is θ = 0.5. We will also be using a single method (in this case, CC) for simplicity, but this can obviously be generalised. Bias can be easily computed using the simsum() function:

sb <- simsum(data = MIsim, estvarname = "b", true = 0.5, se = "se")
tidy(summary(sb, stats = "bias"))

  stat        est        mcse       lower      upper
1 bias 0.01676616 0.004778676 0.007400129 0.02613219

Let’s now calculate relative bias, by hand, together with bias (to compare with the output of simsum()):

data.frame(
  bias = mean(MIsim$b) - 0.5,
  relative_bias = (mean(MIsim$b) - 0.5) / 0.5
)

        bias relative_bias
1 0.01676616    0.03353232

We get the same results: good! Then, let’s use the jackknife to calculate the Monte Carlo standard error.

First, let’s calculate all jackknife (leave-one-out) replicates and plot their distribution:

jk.estimate <- vapply(
  X = seq_along(MIsim$b),
  FUN = function(x) (mean(MIsim$b[-x] - 0.5) / 0.5), # Relative bias
  FUN.VALUE = numeric(1)
)
hist(x = jk.estimate, xlab = "Jackknife Replicates", main = "")

Then, let’s calculate the jackknife standard error, using the following formula:

n <- length(jk.estimate)
relative_bias_mcse <- sqrt(((n - 1) / n) * sum((jk.estimate - mean(jk.estimate))^2))
relative_bias_mcse

[1] 0.009557351

…and there you have it! We can also validate this procedure by calculating the jackknife Monte Carlo standard errors for bias:

jk.estimate <- vapply(
  X = seq_along(MIsim$b),
  FUN = function(x) mean(MIsim$b[-x] - 0.5), # Bias
  FUN.VALUE = numeric(1)
)
n <- length(jk.estimate)
bias_mcse <- sqrt(((n - 1) / n) * sum((jk.estimate - mean(jk.estimate))^2))
bias_mcse

[1] 0.004778676

Recall the results from simsum():

tidy(summary(sb, stats = "bias"))

  stat        est        mcse       lower      upper
1 bias 0.01676616 0.004778676 0.007400129 0.02613219

Once again, this is exactly the same, which is good.

Actually…

It turns out, we don’t actually need the jackknife to obtain Monte Carlo standard errors (MCSE) for relative bias (RB). It can be shown (as usual, exercise left to the reader) that we can obtain a closed-form formula:

where RB hat is the estimated relative bias computed using the estimator above. Let’s compare this with the jackknife estimate. First, recall the estimated MCSE using the jackknife:

relative_bias_mcse

[1] 0.009557351

Using the formula above:

n <- length(MIsim$b)
rb <- (mean(MIsim$b) - 0.5) / 0.5

sqrt(1 / (n * (n - 1)) * sum(((MIsim$b - 0.5) / 0.5 - rb)^2))

[1] 0.009557351

…and they’re the same. Cool!

If you are still not convinced, let me show you something else. What is another method that one could use to estimate standard errors for any statistics you can pretty much think of? Well, but of course, it’s our best friend the bootstrap!

Let’s use non-parametric bootstrap to calculate the standard error of our relative bias estimator. For that, we use the {boot} package in R:

library(boot)

set.seed(387456)
rbfun <- function(data, i, .true) mean((data[i] - .true) / .true)
bse <- boot(data = MIsim$b, rbfun, .true = 0.5, R = 50000)

Note that rbfun() is a function we define to calculate relative bias, that we use R = 50000 bootstrap samples to ensure convergence of the procedure, and that we set a seed (for reproducibility). The results of the bootstrap are printed below:

bse


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = MIsim$b, statistic = rbfun, R = 50000, .true = 0.5)


Bootstrap Statistics :
      original        bias    std. error
t1* 0.03353232 -4.320035e-05 0.009571821

Once again: pretty close, not exactly the same (as the bootstrap is still a stochastic procedure), but close enough that we can be confident that they converge to the same value.

Finally, let me reassure you that the bootstrap has converged (and, side note, you should remember to do that too when you use it):

cmsds <- vapply(
  X = seq_along(as.numeric(bse$t)),
  FUN = function(i) sd(bse$t[1:i]),
  FUN.VALUE = numeric(1)
)
plot(
  x = cmsds,
  type = "l",
  ylab = "Boostrap SE",
  xlab = "Bootstrap Iteration"
)

…looks like the bootstrap converged, and that, probably, we did not need so many bootstrap samples after all. Luckily computations were so inexpensive that we could afford it (the whole thing took just a few seconds on my laptop), so that’s ultimately fine.

Wrap-up

And there you have it, examples of how to use the jackknife (and the bootstrap!) to estimate standard errors for a given metric of interest. Time to go wild and apply this to your settings!

In conclusion, I think these are very powerful tools that every statistician should be at least familiar with; there is plenty of literature on the topic, let me know if you’d like some references. Hope you learned something from this, I sure did by writing up all of this – and if you are still reading, here’s some breaking news:

Relative bias with Monte Carlo errors will be available in the next release of {rsimsum}! Coming soon to your nearest CRAN server…

Thanks for reading, and until next time, take care!

R on a Raspberry Pi Zero W

Mon, 25 Jul 2022 22:00:00 GMT

If you know me, you should know that I am obsessed with tiny computers and that some time ago I got myself a Raspberry Pi Zero W on which I run Pi-hole to block ads and tracking within our home network. How tiny is a Pi Zero W, you might wonder? Well, here it is:

And yes, that is a 13-inch MacBook Pro in the back and that’s a microSD card for storage:

I mean, look how tiny and cute it looks in its red-and-white case:

The whole case is barely thicker than the MacBook Pro itself, while the board is exactly 65 by 30 mm. It is, of course, not a powerhouse, being powered by a single-core, 1 GHz, 32-bit CPU with 512 MB of RAM, but that is more than enough for Pi-hole and some tinkering. And did I mention that it costs only $10? No? Well, isn’t that great?

Anyway, what you might not know is that, being powered by a full Debian-based distribution, you can install a variety of software using the apt package manager. Including R which, however, is often lagging behind the latest CRAN release (I could only install R 3.5.2 using apt).

Interestingly, a couple of days ago while I was browsing the schedule for rstudio::conf(2022), I came across the R for the Raspberry Pi project. It aims to provide up-to-date builds of R for Raspberry Pi computers, which can be installed in few simple steps. I mean, I obviously had to check it out!

…here we are, connecting to the headless Pi via SSH on my macOS terminal:

Nice, even though the latest available version (for my Pi Zero W, at least) is only R 4.1.2.

The next obvious question is: can you actually do something with R on such a low-powered computer?

To answer this question, I set up a short benchmark script that simulates data from a logistic regression model, for 1,000 subjects, and then fits the corresponding, true model 100 times (using the {microbenchmark} package). The script is available in the following Gist if you want to try running it on your machine for comparison:

Here are the results:

The median time was 106.4 milliseconds, with an interquartile interval of 105.9 to 118.4 milliseconds. Not bad! For comparison, let’s run the benchmark script on my 2019, 13-inch MacBook Pro (2.4 GHz, quad-core i5 processor, an i5-8279U, with 16 GB of RAM). That can be done in just a couple of lines, directly from your R session:

gist_url <- "https://gist.githubusercontent.com/ellessenne/5e0c35a7a0625d9dd5bbe8c284d52155/raw/9392820a5f53d8f438a19f1518e30eeca1df30f3/pizw-bench.R"
source(gist_url)

Unit: milliseconds
                expr      min       lq     mean  median       uq      max neval
 logistic regression 2.364326 2.817422 3.720309 3.29978 3.823631 12.73771   100

Okay, I guess the Pi Zero W is slow compared to my relatively modern laptop, approximately 30 to 40 times slower… but hey! It’s machine learning on a $10 computer, isn’t that awesome?

It’s clearly not enough for large projects, but a fantastic option for democratising data science, thanks to open-source software and a tiny $10 computer. If you want to read more about the concept, don’t forget to check out this early draft by the team of Jeff Leek on the importance of democratising data science education. The future is bright!

Numerical integration in generalised linear mixed models

Fri, 17 Jun 2022 22:00:00 GMT

Hi everyone!

I was in class a few weeks ago helping with a course on longitudinal data analysis, and towards the end of the course, we introduced generalised linear mixed models (GLMMs).

Loosely speaking, GLMMs generalise linear mixed models (LMMs) in the same way that generalised linear models (GLMs) extend the standard linear regression model: by allowing to model outcomes that follow distributions other than the gaussian, such as Poisson (e.g., for count data) or binomial (e.g., for binary outcomes – looking at you, logistic regression). Using proper notation, in GLMMs the response for the i^th subject at the j^th measurement occasion, , is assumed to follow any distribution from the exponential family such that where is a known link function and are the subject-specific random effects, assumed to follow a multivariate distribution with zero mean and variance-covariance matrix , . Given that random effects are subject-specific, responses within each individual will be correlated.

Let’s go one step forward by defining the GLMM-equivalent of a logistic regression model, where the link function is a logistic function, and assuming a random intercept only (for simplicity): Here can represent any set of fixed effects covariates, such as a treatment assignment, time, or something like that, a vector of regression coefficients, and is univariately normal. Note that interpretation of the fixed effects is similar to that of regression coefficients from a logistic regression model, as they can be interpreted as log odds ratios. The main difference, however, is that this interpretation is conditional on random effects being set to zero – loosely speaking, this interpretation holds for an average subject (in terms of random effects). It is fundamentally important to keep this in mind when interpreting the results of LMMs and GLMMs!

To estimate a GLMM, we can use the maximum likelihood approach. The joint probability density function of is given by where the random effect components are however latent (e.g., unobserved and unobservable). In order to calculate the individual contributions to the likelihood , we thus need to integrate out the density of the random effects: …which unfortunately does not have a closed, analytical form for GLMMs.

Luckily, several methods have been proposed throughout the years to approximate the value of integrals that do not have a closed-form; this is generally referred to as numerical integration.

Now, this is where we go off a tangent; buckle up, boys.

Numerical integration

According to Wikipedia,

“[…] numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral.”

As I mentioned above, the basic problem that numerical integration aims to solve is to approximate the value of a definite integral such as to a given degree of accuracy.

The term numerical integration first appeared in 1915 (who knew?!) and, as you can imagine, a plethora of approaches have been proposed throughout the years to solve this problem.

Several approaches are based on deriving interpolating functions that are easy to integrate, such as polynomials of low degree (e.g., linear or quadratic). The simplest approach of this kind is the so-called rectangle rule (or midpoint rule), where the interpolating function is a constant function (i.e. a polynomial of degree zero) that passes through the midpoint: Of course, the smart thing about this approach is that we can divide the integral into a large number of sub-intervals, thus increasing the accuracy of the approximation. Let’s illustrate this by assuming that we are trying to integrate a standard normal distribution (the dnorm() function in R), e.g. . With five sub-intervals, the approximation would look like:

And the approximated integral would be 1.59684 (the true value is one, given that we are integrating a distribution). With more sub-intervals, e.g. 15, the accuracy of this approximation improves:

Now, the approximated integral is 1.00003, pretty much spot on. We can test how many sub-intervals are reguired to get a good approximation:

We can see that by using approximately 8-10 sub-intervals or more we can get a (very!) good approximation. Now, intuitively this works so well because we are trying to integrate a simple function; what if we had a more complex function, such as that we want to integrate between zero and 20?

As you can see from the plot above, the performance here is much worse than before.

The simple rectangle rule can now be improved by using more complex interpolating functions (e.g., using a polynomial of degree one, leading to a trapezoidal rule, or a polynomial of degree two, leading to a Simpson’s rule). We will however jump a million steps ahead (…sorry! The Wikipedia article on numerical integration includes a bunch more details if you want to read more about this), and talk about more elaborate numerical integration methods that are actually used when estimating the likelihood function of a GLMM.

Specifically, two approaches are routinely used, the first of which is Laplace approximation. The Laplace approximation uses a second-order Taylor series expansion, based on re-writing the intractable integrand from the likelihood contributions, which allows deriving closed-form expressions for the approximation. This is generally the fastest approach, and it can be shown that the approximation is asymptotically exact for an increasing number of observations per random effect. See, e.g., here for more details on the Laplace approximation in the settings of GLMMs.

Nevertheless, in practical applications, the accuracy of the Laplace approximation may still be of concern. Thus, a second approach for numerical integration is often used to obtain more accurate approximations (while however being more computationally demanding): Gaussian quadrature.

A quadrature rule is an approximation of a definite integral of a function that is stated as a weighted sum of function values at specified points within the domain of integration. For instance, when integrating a function over the domain , such rule (using points) is defined as: This rule is exact for polynomials of degree or less. For our specific problem, the domain of integration is (the domain of the normal distribution of the random effects); this leads to a so-called Gauss-Hermite quadrature rule, which given the normal distribution that we are trying to approximate, has optimal properties. Specifically, Gauss-Hermite integration with a function kernel of for a normal distribution with mean and variance , leads to the following rule: with the density of a standard normal distribution (such as dnorm() in R).

Let’s get back to the example that we used above, integrating a standard normal distribution, to visualise Gauss-Hermite quadrature. This can be visualised, for 5-points quadrature, as:

For 11-points quadrature:

Testing convergence of this procedure:

As before, already with 7 quadrature points we get a good approximation of the density. Anecdotally, a larger number of quadrature points might be required to get a good approximation in practice, which would require more function evaluations and therefore a larger computational burden. Furthermore, this problem becomes exponentially more complex in higher dimensions, e.g., with more than one random effect: a univariate quadrature rule with 5 points requires 5 function evaluations, while a bivariate one with the same number of points requires 5²=25 evaluations every single time the intractable integral is evaluated. Interestingly, Pinheiro and Bates showed that it is possible to adapt the integration procedure to be more computationally efficient in higher dimensions, by centring and scaling the quadrature procedure using conditional moments of the random effects distribution.

Now, we arrived at the end of this detour in the world of numerical integration; there are many more details that I skipped here, but the references linked above should provide a good starting point for the interested reader. Let’s get back to business.

Numerical integration in GLMMs

Remember that the individual contributions to the likelihood for a GLMM are intractable, thus requiring numerical integration:

By now you can probably appreciate that accuracy of this approximation is key, but in practical terms, what does this mean? In other words, what happens if we fail to accurately approximate that integral?

Yes, my friends, it is finally time to get to the meat of this post, a mere 2,000 words later!

Simulation? Simulation!

We can try to answer that question using statistical simulation. If you know me, you might have noticed that I think that simulations are a fantastic tool for learning about complex statistical concepts. What better than this to further prove the point, then?

Let’s start by describing the protocol of this simulation study. We use the ADEMP framework to structure the protocol (and if you’re not on the ADEMP bandwagon, jump on, there’s still some space left).

Aim

The aim of this simulation study is to test the accuracy of different numerical integration methods in terms of estimation of fixed and random effects in GLMMs.

Data-generating mechanism

We use a single data generating mechanism for simplicity. We simulate binary outcomes for the i^th subject, j^th occasion from the following data-generating model:

Note that:

We include a binary treatment which is assigned a random at baseline by drawing from a Bernoulli random variable with a success probability of 0.5;
Time between each measurement, in years, for each subject, is simulated by drawing from a Uniform(0, 1) random variable;
I allow for a maximum of 100 measurements per subject, and truncate follow-up after 10 years;
is a random intercept, simulated by drawing a subject-specific value from a normal distribution with mean zero and standard deviation = 4;
The regression parameters are assigned the values -2, -1, 0.5, and -0.1 for , , , and , respectively.

Finally, every simulated dataset includes 500 subjects.

Estimands

The estimands of interest are the regression parameters and the standard deviation of the random intercept .

Methods

We fit the true, data-generating GLMM as implemented in the glmer() function from the {lme4} package in R, where we vary, however, the numerical integration method. In glmer(), this is defined by the nAGQ argument. We use the following methods for numerical integration:

nAGQ = 0, corresponding to a faster but less exact form of parameter estimation for GLMMs by optimizing the random effects and the fixed-effects coefficients in the penalised iteratively reweighted least-squares step;
nAGQ = 1, corresponding to the Laplace approximation. This is the default for glmer();
nAGQ = k, with k number of points used for the adaptive Gauss-Hermite quadrature method. I test values of k from 2 to 10, and then from 15 to 50 at steps of 5, for a total of 17 possible values of k. Remember that, as the glmer() documentation points out, larger values of k produce greater accuracy in the evaluation of the log-likelihood at the expense of speed.

In total, 19 models are fit to each simulated dataset and compared with this simulation study.

Performance measure

The main performance measure of interest is bias in the estimands of interest, to test the accuracy of the different numerical integration methods.

Number of repetitions

An important issue to keep in mind when running simulation studies is that of Monte Carlo error. Loosely speaking, there is a certain amount of randomness in the simulation and therefore uncertainty in the estimation of the performance measures of interest, thus we have to make sure that we can estimate them accurately (e.g., with low Monte Carlo error). This can be done iteratively, e.g., by running repetitions until the Monte Carlo error drops below a certain threshold (see here for more details).

For simplicity, we take a different approach here. What we do is the following:

Running 20 repetitions of this simulation study;
Using these 20 repetitions to estimate empirical standard errors and average model-based standard errors, for each estimand, and taking the largest value (denoted with );
Using the value that was just estimated to estimate how many repetitions would be needed to constrain Monte Carlo error to be less than 0.01, using the formula n_sim = ;
Rounding up the value of n_sim to the nearest 100, to be conservative.

n_sim was finally estimated to be 1700, in this case, given .

Results

At last, here are the results of this simulation study.

First, we assess whether all Monte Carlo errors for bias are below the threshold of 0.01. The plot below shows that the number of repetitions of this simulation study was enough to constrain Monte Carlo error within what we deemed acceptable.

Note that (Intercept) corresponds to , trt corresponds to , time corresponds to , and trt:time corresponds to . Unsurprisingly, sd__(Intercept) represents the standard deviation of the random intercept, .

Second, we study bias for the fixed effects (e.g., the regression coefficients of the GLMM). The following plot depicts bias with 95% confidence intervals (based on Monte Carlo standard errors) for all fixed effects and across all integration methods:

This shows that the fast-but-approximate method yields biased results, and that so does the Laplace approximation (in these settings, and except for the treatment by time interaction term). Overall, this shows that a larger number of points for the adaptive quadrature (e.g., 10) is required to obtain unbiased results for all regression coefficients.

Finally, the next plot shows bias for the standard deviation of the random intercept:

Again, we can see that a larger number of quadrature points is required to get a good approximation with no bias.

Another thing that is interesting to study here is estimation time. As mentioned many times before, the larger the number of quadrature points the greater the accuracy, at the cost of additional computational complexity. But how much overhead do we have in these settings?

I’m glad you asked: the following plot depicts the distribution of estimation times for each GLMMs under each integration approach:

The first thing to note is that the fast-but-approximate method is actually really fast! Second, and as expected, with more quadrature points the median estimation time was also larger. No surprises so far. Overall, though, glmer() seemed to be really fast in the settings, with estimation times rarely exceeding 25 seconds.

An actual example

Before we wrap up, let’s compare the different methods in practice by analysing a real dataset. For this, we use data from the 1989 Bangladesh fertility survey, which can be obtained directly from Stata:

library(haven)
library(dplyr)

bangladesh <- read_dta("http://www.stata-press.com/data/r17/bangladesh.dta")
glimpse(bangladesh)

Rows: 1,934
Columns: 8
$ district  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ c_use     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, …
$ urban     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ age       18.440001, -5.559990, 1.440001, 8.440001, -13.559900, -11.559…
$ child1    0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0…
$ child2    0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ child3    1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0…
$ children  3, 0, 2, 3, 0, 0, 3, 3, 1, 3, 0, 0, 1, 3, 3, 3, 0, 3, 1, …

We will fit a GLMM model for a binary outcome trying to study contraceptive use by urban residence, age and number of children; we also include a random intercept by district.

For illustration purposes, we fit three models:

GLMM estimated with the fast-but-approximate (nAGQ = 0) approach;
GLMM estimated with the Laplace approximation (nAGQ = 1);
GLMM estimated with adaptive Gauss-Hermite quadrature using 50 points (nAGQ = 50).

The models can be fit with the following code:

library(lme4)

f.0 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 0)
f.1 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 1)
f.50 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 50)

…and we use the {broom.mixed} package to tidy, summarise, and compare the results from the three models:

library(broom.mixed)
library(ggplot2)

bind_rows(
  tidy(f.0, conf.int = TRUE) %>% mutate(nAGQ = "f.0"),
  tidy(f.1, conf.int = TRUE) %>% mutate(nAGQ = "f.1"),
  tidy(f.50, conf.int = TRUE) %>% mutate(nAGQ = "f.50")
) %>%
  ggplot(aes(x = nAGQ, y = estimate)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 1 / 3) +
  geom_point() +
  scale_y_continuous(n.breaks = 5) +
  theme_bw(base_size = 12) +
  facet_wrap(~term, scales = "free_y") +
  labs(x = "Method (nAGQ)", y = "Point Estimate (95% C.I.)")

Model coefficients from the three integration approaches are very similar in this case, thus we can be confident that the results of this analysis are likely not affected by that. Let’s thus print the summary of the model using adaptive quadrature:

summary(f.50)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
 Family: binomial  ( logit )
Formula: c_use ~ urban + age + child1 + child2 + child3 + (1 | district)
   Data: bangladesh

     AIC      BIC   logLik deviance df.resid 
  2427.7   2466.6  -1206.8   2413.7     1927 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8140 -0.7662 -0.5087  0.9973  2.7239 

Random effects:
 Groups   Name        Variance Std.Dev.
 district (Intercept) 0.2156   0.4643  
Number of obs: 1934, groups:  district, 60

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.689295   0.147759 -11.433  < 2e-16 ***
urban        0.732285   0.119486   6.129 8.86e-10 ***
age         -0.026498   0.007892  -3.358 0.000786 ***
child1       1.116005   0.158092   7.059 1.67e-12 ***
child2       1.365893   0.174669   7.820 5.29e-15 ***
child3       1.344030   0.179655   7.481 7.37e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
       (Intr) urban  age    child1 child2
urban  -0.288                            
age     0.449 -0.044                     
child1 -0.588  0.054 -0.210              
child2 -0.634  0.090 -0.382  0.489       
child3 -0.751  0.093 -0.675  0.538  0.623

Furthermore, we can interpret exponentiated fixed effect coefficients as odds ratios:

exp(fixef(f.50))

(Intercept)       urban         age      child1      child2      child3 
  0.1846497   2.0798277   0.9738498   3.0526345   3.9192232   3.8344665

This shows that contraceptive use is higher in urban versus rural areas, with double the odds, that older women have lower odds of using contraceptives, and that the odds of using contraceptives are much higher in women with one or more children (odds ratios between 3 and 4 for women with 1, 2, or 3 children versus women with no children) – all else being equal. Nonetheless, remember that these results are conditional on a random intercept of zero, e.g., loosely speaking, for an average district.

Finally, there is also significant heterogeneity in contraceptive use between districts, with a variance for the random intercept of 0.2156. The distribution of the (log-) baseline odds of contraceptive use across districts can be visualised as:

ggplot(data = coef(f.50)$district, aes(x = `(Intercept)`)) +
  geom_density() +
  theme_bw(base_size = 12) +
  labs(x = "Baseline log-odds of contraceptive use", y = "Density")

Closing thoughts

In conclusion, we saw how easy it is to get biased fixed effect estimates in GLMMs if we use a numerical integration method that is not accurate enough. Of course, this is just a single scenario and not a comprehensive study of estimation methods for GLMMs, but it should nonetheless show the importance of considering the estimation procedure when using (somewhat) more advanced statistical methods. This is, unfortunately, often overlooked in practice. My friendly suggestion would be to fit multiple models with different integration techniques, and if no significant differences are observed, then great: you got yourself some results that are robust to the numerical integration method being used.

This wraps up the longest blog post I’ve ever written (at ~3,500 words): if you made it this far, thanks for reading, and I hope you found this educational. And as always, please do get in touch if I got something wrong here.

Finally, R code for the simulation study, if you want to replicate the analysis or adapt it to other settings, is of course freely available on a GitHub repository here.

Cheers!

How to lose your talent in a simple step

Mon, 09 May 2022 22:00:00 GMT

Zoë Schiffer from The Verge reports that the director of machine learning at Apple is leaving the company due to its return to work policy:

Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
— Zoë Schiffer (@ZoeSchiffer) May 7, 2022

I have had this conversation with peers and co-workers over and over again in the past few months: there is no going back to pre-pandemic life. And while I value in-person activities (looking at you, in-person research seminars), I am totally sure that organisations failing to accommodate what people want will lose talent to more flexible workplaces.

Update: as of May 18, 2022, it looks like he is joining Alphabet’s DeepMind. What a loss for Apple.

Maximum likelihood estimation with {torch}

Wed, 27 Apr 2022 22:00:00 GMT

Hi everyone!

Today’s blog post is a long time in the making, as I have been playing around with what we’re going to see today for quite a while now.

Let’s start with {torch}: what is that? Well, {torch} is an R package wrapping the libtorch C++ library underlying the PyTorch open-source machine learning framework. It provides a variety of tools for developing machine learning methods, but there’s more: what we will focus on here is automatic differentiation and general-purpose optimisers.

Having these tools at our disposal lets us implement maximum likelihood estimation with state of the art tools. I will illustrate this using a simple parametric survival model, but as you can imagine, this generalises to more complex methods.

Parametric survival model

We will optimise an exponential survival model, for simplicity, whose log-likelihood function can be written as

Here is the event indicator variable, is the model design matrix, are regression coefficients, and is the observed time. Note that includes an intercept, which corresponds to the rate parameter on the log scale.

Other parametric distributions (such as Weibull, Gompertz, etc.) are equally easy to implement, let me know if you fancy trying this out for yourself!

Simulating data

Let’s start by simulating some data. To simulate survival data from a given parametric distribution, I use the inversion method as described in Bender et al., assuming a single binary covariate (e.g., a binary treatment):

set.seed(183475683) # For reproducibility

N <- 100000
lambda <- 0.2
beta <- -0.5
covs <- data.frame(id = seq(N), trt = stats::rbinom(N, 1L, 0.5))

# Inversion method for survival times:
u <- runif(N)
T <- -log(u) / (lambda * exp(covs$trt * beta))

We also apply administrative censoring at time 5:

d <- as.numeric(T <= 5)
T <- pmin(T, 5)
s1 <- data.frame(id = seq(N), eventtime = T, status = d)
dd <- merge(covs, s1)

We simulate data for 10^{5} subjects (which should be plenty enough to get pretty close to the truth if our implementation is correct), a rate parameter of 0.2 and a regression coefficient of -0.5. If interested, a more general implementation of this method can be found in the {simsurv} package.

As a test, let’s fit and plot a Kaplan-Meier survival curve:

library(survival)

KM <- survfit(Surv(eventtime, status) ~ 1, data = dd)
plot(KM, xlab = "Time", ylab = "Survival")

Looks alright!

Likelihood implementation

Now, we implement the (log-) likelihood function using {torch}. The important thing to remember here is that {torch} uses tensors, on which we need to operate e.g. using torch_multiply for matrix multiplication:

library(torch)

log_likelihood <- function(par, data, status, time) {
  ll <- torch_multiply(status, torch_mm(data, par)) -
    torch_multiply(torch_exp(torch_mm(data, par)), time) +
    torch_multiply(status, torch_log(time))
  ll <- -torch_sum(ll)
  return(ll)
}

As a test, let’s define starting values for the model parameters (e.g., fixing their values at 1) and calculate the value of the (negative) log-likelihood function:

xx <- torch_tensor(matrix(c(1, 1), ncol = 1))
log_likelihood(
  par = xx,
  data = torch_tensor(data = model.matrix(~trt, data = dd)),
  status = torch_tensor(matrix(dd$status, ncol = 1)),
  time = torch_tensor(matrix(dd$eventtime, ncol = 1))
)

torch_tensor
1.71723e+06
[ CPUFloatType{} ]

Looking good so far.

Likelihood optimisation

The final step consists of implementing the algorithm to optimise the likelihood. We start by re-defining starting values:

x_star <- torch_tensor(matrix(c(1, 1), ncol = 1), requires_grad = TRUE, )

Here we need to use the argument requires_grad = TRUE to use automatic differentiation and get gradients for free.

Next, we pick a general-purpose optimiser:

optimizer <- optim_lbfgs(params = x_star, line_search_fn = "strong_wolfe")

We pick the L-BFGS algorithm with strong Wolfe conditions for the line search algorithm, but any would do. Note that a comparable algorithm is implemented in base R as optim()’s L-BFGS-B method.

We also need to define one extra function that will be used in the optimisation loop to make each step towards the optimum:

one_step <- function() {
  optimizer$zero_grad()
  value <- log_likelihood(
    par = x_star,
    data = torch_tensor(data = model.matrix(~trt, data = dd)),
    status = torch_tensor(matrix(dd$status, ncol = 1)),
    time = torch_tensor(matrix(dd$eventtime, ncol = 1))
  )
  value$backward(retain_graph = TRUE)
  value
}

We finally have all the bits to actually optimise the likelihood.

We define the required precision as eps = 1e-6, and we loop until the difference in log-likelihood is less than (or equal to) eps:

eps <- 1e-6 # Precision
converged <- FALSE # Used to stop the loop
last_val <- Inf # Need a value to compare to for the first iteration
i <- 0 # Iterations counter

while (!converged) {
  i <- i + 1
  obj_val <- optimizer$step(one_step)
  if (as.logical(torch_less_equal(torch_abs(obj_val - last_val), eps))) {
    print(i) # This will print how many iterations were required before stopping
    converged <- TRUE
    break
  }
  if (i >= 10000) {
    # For safety
    stop("Did not converge after 10000 iterations", call. = FALSE)
  }
  last_val <- obj_val
}

[1] 3

That’s it! The results of the optimisation are contained in the x_star object:

x_star

torch_tensor
-1.6160
-0.4951
[ CPUFloatType{2,1} ][ requires_grad = TRUE ]

…remember that the true values that we simulated data from were:

log(lambda)

[1] -1.609438

# and
beta

[1] -0.5

Which is pretty close to what we had estimated. Of course, this is a single replication only, and we might want to test this with smaller sample sizes. Nevertheless, the test sample size is large enough that I would feel comfortable with this implementation.

Conclusions

One thing that is missing from the implementation above is the estimation of confidence intervals for the model parameters.

We get the gradients for free, so that should be straightforward after inverting the Hessian matrix at the optimum. Despite that, the R interface does not implement (yet) direct calculation of the Hessian via the torch.autograd.functional.hessian function so we need to work a little harder for that.

Specifically, we have to differentiate the gradients again to obtain the Hessian matrix:

ll <- log_likelihood(
  par = x_star,
  data = torch_tensor(data = model.matrix(~trt, data = dd)),
  status = torch_tensor(matrix(dd$status, ncol = 1)),
  time = torch_tensor(matrix(dd$eventtime, ncol = 1))
)
grad <- autograd_grad(ll, x_star, retain_graph = TRUE, create_graph = TRUE)[[1]]

# Using base R matrix here for simplicity
hess <- matrix(data = NA, nrow = length(x_star), ncol = length(x_star))
for (d in 1:length(grad)) {
  hess[d, ] <- as_array(autograd_grad(grad[d], x_star, retain_graph = TRUE)[[1]])
}

The variance-covariance matrix for the model coefficients will now be the inverse of the Hessian:

vcv <- solve(hess)

To wrap up, the fitted model coefficients (with standard errors) will be:

results <- data.frame(beta = as_array(x_star), se = sqrt(diag(vcv)))
results

        beta          se
1 -1.6159527 0.005639768
2 -0.4950813 0.008709507

Hopefully, the {torch} package in R will soon port the automatic Hessian calculation, which will simplify things further.

Finally, for comparison, we fit the same model using the equivalent R implementation from the (experimental) {streg} package:

library(streg)
expfit <- streg(
  formula = Surv(eventtime, status) ~ trt,
  data = dd,
  distribution = "exp"
)
summary(expfit)

Exponential regression -- log-relative hazard form

N. of subjects  = 100000 
N. of failures  = 54174 
Time at risk    = 345727.5 

Log likelihood  = -131655.9 

             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.615311   0.005637 -286.54   <2e-16 ***
trt         -0.495611   0.008707  -56.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This also uses automatic differentiation (via {TMB}), but nevertheless… pretty close, isn’t it?

That’s it for today, and as always, thank you for reading and feel free to get in touch if I got something terribly wrong or if you just want to have a chat about it. Cheers!

Assessing convergence of a simulation study

Thu, 30 Dec 2021 23:00:00 GMT

Hey everyone,

The blog is back! And just in time to wrap up 2021.

Let’s get straight to business, shall we? Today we’re going to talk about ways of assessing whether your simulation study has converged. This is something that’s been on my mind for a while, and I finally got some focused time to write about it.

To do so, let’s first define what we mean with converged:

We define a simulation study as converged if:

Estimation of our key performance measures is stable, and

Monte Carlo error is small enough.

We are going to use the MIsim dataset, which you should be familiar with by now if you’ve been here before, and which comes bundled with the {rsimsum} package:

library(rsimsum)
data("MIsim", package = "rsimsum")
head(MIsim)

# A tibble: 6 × 4
  dataset method      b    se
         
1       1 CC      0.707 0.147
2       1 MI_T    0.684 0.126
3       1 MI_LOGT 0.712 0.141
4       2 CC      0.349 0.160
5       2 MI_T    0.406 0.141
6       2 MI_LOGT 0.429 0.136

To find out more about this data, type ?MIsim in your R console after loading the {rsimsum} package.

Interestingly, this dataset already includes a column named dataset indexing each repetition of the simulation study. If that was not the case, we should create such a column at this stage.

Let’s also load the {tidyverse} package for data wrangling and visualisation with {ggplot2}:

library(tidyverse)

The first step consists of computing performance measures cumulatively, e.g. for the first i repetitions (where i goes from 10 to the total number of repetitions that we run for our study, in this case, 1000). We use 10 here as the starting point as we assume that we should run at least 10 repetitions to get any useful result.

This can be easily done using our good ol’ friend the simsum function and map_ from the {purrr} package; specifically, we use map_dfr() to get a dataset obtained by stacking rows instead of a list object. Notice also that we only focus on bias as the performance measure of interest here; this could be done, in principle, for any other performance measure.

full_results <- map_dfr(
  .x = 10:1000,
  .f = function(i) {
    s <- simsum(
      data = filter(MIsim, dataset <= i),
      estvarname = "b",
      se = "se",
      true = 0.5,
      methodvar = "method",
      ref = "CC"
    )
    s <- summary(s)
    results <- rsimsum:::tidy.summary.simsum(s)
    results <- filter(results, stat == "bias") %>%
      mutate(i = i)
    return(results)
  }
)
head(full_results)

  stat         est       mcse  method       lower      upper  i
1 bias 0.006621292 0.04267397      CC -0.07701815 0.09026074 10
2 bias 0.016931728 0.03830584 MI_LOGT -0.05814635 0.09200980 10
3 bias 0.008187965 0.03047714    MI_T -0.05154612 0.06792205 10
4 bias 0.009806267 0.03873124      CC -0.06610556 0.08571810 11
5 bias 0.021526182 0.03495223 MI_LOGT -0.04697892 0.09003129 11
6 bias 0.016194059 0.02870663    MI_T -0.04006990 0.07245802 11

We can use these results to plot estimated bias (and corresponding Monte Carlo standard errors) over repetition numbers:

library(hrbrthemes) # Provide some nicer themes and colour scales

ggplot(full_results, aes(x = i, y = est, color = method)) +
  geom_line() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Estimated Bias", color = "Method", title = "Bias over (cumulative) repetition number")

ggplot(full_results, aes(x = i, y = mcse, color = method)) +
  geom_line() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Estimated Monte Carlo S.E.", color = "Method", title = "Monte Carlo S.E. over (cumulative) repetition number")

What shall we be looking for in these plots?

Basically, if (and when) the curves flatten. Remember flatten the curve? That applies here as well.

For Monte Carlo standard errors, we want to also check when they reach a threshold of uncertainty (e.g. 0.01) that we are willing to accept. We can also add this threshold to the plot to help our intuition:

ggplot(full_results, aes(x = i, y = mcse, color = method)) +
  geom_hline(yintercept = 0.01, color = "red", linetype = "dotted") +
  geom_line() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Estimated Monte Carlo S.E.", color = "Method", title = "Monte Carlo S.E. over (cumulative) repetition number")

The plots above seem to be stable enough after ~500 repetitions, while Monte Carlo errors cross our threshold after ~250 repetitions. If I had to interpret this, I would be satisfied with the convergence of the study.

The dataset with results (full_results) includes confidence intervals for estimated bias, at each (cumulative) repetition number, based on Monte Carlo standard errors. We can, therefore, further enhance the first plot introduced above:

ggplot(full_results, aes(x = i, y = est)) +
  geom_ribbon(aes(ymin = lower, ymax = upper, fill = method), alpha = 1 / 5) +
  geom_line(aes(color = method)) +
  scale_fill_ipsum() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Estimated Bias", color = "Method", fill = "Method")

Adding confidence intervals surely enhances our perception of stability.

Now, we could stop here and call it a day, but where’s the fun in that? Let’s take it to the next level.

I think that an even better way of assessing convergence is to check whether the incremental value of an additional repetition affects the results of the study (or not). When running additional repetitions stop adding value (e.g. changing the results), then we can (safely?) assume we have converged to stable results.

Let’s, therefore, calculate this difference (e.g. bias at i^th iteration versus bias at the (i-1)^th iteration), and let’s do it for both estimated bias and Monte Carlo standard error:

full_results <- arrange(full_results, method, i) %>%
  group_by(method) %>%
  mutate(
    lag_est = lag(est),
    lag_mcse = lag(mcse),
    diff_est = est - lag_est,
    diff_mcse = mcse - lag_mcse
  )

This difference is what we now decide to plot:

p1 <- ggplot(full_results, aes(x = i, y = diff_est, color = method)) +
  geom_line() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Difference", color = "Method", title = expression(Bias ~ difference:~ i^th - (i - 1)^th))
p1

p2 <- ggplot(full_results, aes(x = i, y = diff_mcse, color = method)) +
  geom_line() +
  scale_color_ipsum() +
  theme_ipsum_rc(base_size = 12) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Repetition #", y = "Difference", color = "Method", title = expression(Monte ~ Carlo ~ S.E. ~ difference:~ i^th - (i - 1)^th))
p2

We save both plot objects p1 and p2, as we might be using that again later on. What we want here are the curves to flatten around zero, which seems to be the case for this example.

The y-scale of the plot is highly influenced by larger differences early on, so we could decide to focus on a narrower range around zero, e.g. for point estimates:

p1 + coord_cartesian(ylim = c(-0.003, 0.003))

With this, we can confirm what we suspected a few plots ago: that yes, the incremental value of running more than ~500 repetitions is somewhat marginal here.

To wrap up, there is one final thing I would like to mention here: all we did is post-hoc. We already run the study, hence all we get to do is to check whether the results are stable and precise enough (in our loose terms defined above).

The implication is an interesting one, in my opinion: you do not need a bunch of iterations if you can avoid it, especially if each repetition is expensive (computationally speaking).

What you could do, if you really wanted to, is to define a stopping rule such as:

After N (e.g. 10) consecutive iterations with a difference below a certain threshold (for the key performance measure), stop the study;
After you reach a given precision for bias (in terms of Monte Carlo error), stop the study;
You name it.

Of course, there are other things to consider such as the sequentiality of repetitions, especially if there are missing values (e.g. due to non-convergence of some iterations) – this is not meant to be, in any way, a comprehensive take on the topic. Feel free to reach out, as always, if you have any comments. Nevertheless, I think I will get back to this topic eventually, so stay tuned for that.

That’s all from me for today, then it must be closing time: talk to you soon and, in the meantime, take care!

rsimsum 0.11.0

Mon, 25 Oct 2021 22:00:00 GMT

Hello!

It’s been a while since the last post on this website… don’t worry (I am sure you didn’t), I’m still here, just been busy with a bunch of life- and work-related things.

This post is to introduce the latest release of the {rsimsum} R package, version 0.11.0, which landed on CRAN last week, on October 20^th.

This is a minor release, with some bug fixes and (more interestingly) the introduction of a new feature that was suggested in #22 on GitHub by Li Ge: print() methods for summary objects now invisibly return the formatted tables that are printed to the console.

Ok, but why should I care about that?

It’s simple: you can print subset of results (e.g. for a presentation) much more easily, as the internals of {rsimsum} will deal with some of the formatting for you.

Here’s an example, using the MIsim dataset that comes bundled with the package:

library(rsimsum)
data("MIsim", package = "rsimsum")

We summarise this simulation study as showed in the documentation here:

s <- simsum(
  data = MIsim,
  estvarname = "b",
  se = "se",
  true = 0.5,
  methodvar = "method",
  ref = "CC"
)
sums <- summary(s, stats = c("bias", "cover"))
sums

Values are:
    Point Estimate (Monte Carlo Standard Error)

Bias in point estimate:
              CC         MI_LOGT             MI_T
 0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)

Coverage of nominal 95% confidence interval:
              CC         MI_LOGT            MI_T
 0.9430 (0.0073) 0.9490 (0.0070) 0.9430 (0.0073)

This is the standard workflow, and we focus on bias and coverage probability for simplicity. At this point, we could copy-paste a subset of the above results in our slides-making tool of choice.

However, that’s not particularly user-friendly, nor easily reproducible (e.g. if you run more iterations and need to update your results). Here’s where invisibly-returned formatted tables come handy:

output <- print(sums)

Values are:
    Point Estimate (Monte Carlo Standard Error)

Bias in point estimate:
              CC         MI_LOGT             MI_T
 0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)

Coverage of nominal 95% confidence interval:
              CC         MI_LOGT            MI_T
 0.9430 (0.0073) 0.9490 (0.0070) 0.9430 (0.0073)

Note here that the output of sums is printed once again, but it is also stored in a variable named output, which contains the formatted tables for each summary statistic of interest:

str(output)

List of 2
 $ Bias in point estimate                     :'data.frame':    1 obs. of  3 variables:
  ..$ CC     : chr "0.0168 (0.0048)"
  ..$ MI_LOGT: chr "0.0009 (0.0042)"
  ..$ MI_T   : chr "-0.0012 (0.0043)"
 $ Coverage of nominal 95% confidence interval:'data.frame':    1 obs. of  3 variables:
  ..$ CC     : chr "0.9430 (0.0073)"
  ..$ MI_LOGT: chr "0.9490 (0.0070)"
  ..$ MI_T   : chr "0.9430 (0.0073)"

With this data at our disposal, we can finally print a better table using the general-purpose kable() function from the {knitr} package:

w <- output$`Bias in point estimate`
knitr::kable(
  x = w,
  align = rep("c", ncol(w))
)

CC	MI_LOGT	MI_T
0.0168 (0.0048)	0.0009 (0.0042)	-0.0012 (0.0043)

Of course we would need to further improve on this for production (e.g. it now spans the whole width of the enclosing

, we might want to style it and resize it accordingly using css), but it is already a clear improvement. Most interestingly, consider including all of the above in a dynamically updated slides deck (e.g. created using the {xaringan} package): much better than before, with very little extra work, and plenty of room for further adjustments if you wish. Not bad!

Finally, the example above is for the summary.simsum() method: this is implemented for multiple estimands as well (in the summary.multisimsum() function), with an additional layer of nesting for the formatted output. I’m sure it’ll be straightforward to figure that out if you want to try it out, let me know if it isn’t.

That’s all for now, hope you find this useful and if you have further suggestions for features you’d like to see in {rsimsum}, don’t hesitate to get in touch or to open an issue on GitHub.

Simulation and Twitter riddles

Thu, 22 Apr 2021 22:00:00 GMT

Hi everyone!

Earlier today I stumbled upon this tweet by Sean J. Taylor:

Without doing the math or looking it up, approximately how many coin flips do you need to observe in order to be 95% confident you know Prob(Heads) to within +/- 1%?
— Sean J. Taylor (@seanjtaylor) April 22, 2021

…and I asked myself:

Can we answer this using simulation?

The answer is that yes, yes we can! Let’s see how we can do that using R (of course). Looks like this is now primarily a statistical simulation blog, but hey…

Let’s start by loading some packages and setting an RNG seed for reproducibility:

library(tidyverse)
library(hrbrthemes)
set.seed(3756)

Then, we define a function that we will use to replicate this experiment:

simfun <- function(i, .prob, .diff = 0.02, .N = 100000, .alpha = 0.05) {
  draw <- rbinom(n = .N, size = 1, prob = .prob)
  df <- tibble(
    n = seq(.N),
    x = cumsum(draw),
    mean = x / n,
    lower = (1 + (n - x + 1) / (x * qf(.alpha / 2, 2 * x, 2 * (n - x + 1))))^(-1),
    upper = (1 + (n - x) / ((x + 1) * qf(1 - .alpha / 2, 2 * (x + 1), 2 * (n - x))))^(-1),
    width = upper - lower
  )
  out <- tibble(i = i, prob = .prob, diff = .diff, y = min(which(df$width <= .diff)))
  return(out)
}

This function:

Repeates the experiment for 1 to 100,000 (.N) draws from a binomial distribution (e.g. our coin toss), with success probability .prob;
Estimates the cumulative proportion of heads (our ones) across all number of draws;
Estimates confidence intervals using the exact formula;
Calculates the width of the confidence interval. For a +/- 1% precision, that corresponds to a width of 2% (or 0.02, depending on the scale being used);
Finally, returns the first number of draws where the width is 0.02 (or less). This will tell us how many draws we needed to draw to get a confidence interval that is narrow enough for our purpose.

Then, we run the experiment B = 200 times, with different values of .prob (as we want to show the required sample sizes over different success probabilities):

B <- 200

results <- map_dfr(.x = seq(B), .f = function(j) {
  simfun(i = j, .prob = runif(1))
})

Let’s plot the results (using a LOESS smoother) versus the success probability:

ggplot(results, aes(x = prob, y = y)) +
  geom_point() +
  geom_smooth(method = "loess", size = 1, color = "#575FCF") +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::comma) +
  coord_cartesian(xlim = c(0, 1)) +
  theme_ipsum(base_size = 12, base_family = "Inconsolata") +
  theme(plot.margin = unit(rep(1, 4), "lines")) +
  labs(x = "Success probability", y = "Required sample size")

Looks like — as one would expect — the largest sample size required is for a success probability of 50%. What is the solution to the riddle then? As any statistician would confirm, it depends: if the coin is fair, then we would need approximately 10,000 tosses. Otherwise, the required sample size can be much smaller, even a tenth. Cool!

Finally, let’s repeat the experiment for a precision of +/- 3%:

results_3p <- map_dfr(.x = seq(B), .f = function(j) {
  simfun(i = j, .diff = 0.06, .prob = runif(1))
})

Comparing with the previous results (code to produce the plot omitted for simplicity):

For a required precision of +/- 3% the sample size that we would need is much smaller, but the shape of the curve (i.e. versus the success probability) remains the same.

Finally, the elephant in the room: of course we could answer this using our old friend mathematics, or with a bit of Bayesian thinking. But where’s the fun in that? This simulation approach lets us play around with parameters and assumptions (e.g. what would happen if we assume a difference confidence level .alpha?), and it’s quite intuitive too. And yeah, let’s be honest: programming and running the experiment is just much more fun!

As always, please do point out all my errors on Twitter. Cheers!

A short ode to simulation

Thu, 25 Feb 2021 23:00:00 GMT

I have been slowly catching up with talks from rstudio::global, and I was so impressed with a couple of presentations on statistical simulation that I had to write something about it!

Yes, you read it right, you’re not dreaming: statistical simulation is now cool!

As you might now, I am also a strong supporter of statistical simulation. Traditionally you might use it to test a new statistical method you are developing, or to compare different methods in unusual settings. You might also want to try to see where, when, and if a method breaks in edge cases.

However, two other (very important) use-cases are highlighted in the talks that I mentioned above:

Simulation for learning and teaching,
Simulation to drive agile infrastructures.

The first scenario is described very well by Chelsea Parlett-Pelleriti:

There are three main take-home messages:

Statistical simulation encourages exploration,
Tests intuition,
And empowers a deeper understanding of complex statistical methods.

I stand by all of these points.

In fact, that’s a great and accessible way to learn: by trying, and trying, and trying again until you finally get it. And if you follow a simulation exercise, it’s even easier: you can modify the parameters at will (starting from a solid foundation), explore, and follow your intuition. I mean, isn’t this the scientific method all along?

The second talk is by Richard Vogg:

During the talk, he gives a couple of examples that highlight the importance of being able to compose data at will:

When you cannot share the real dataset (e.g. for privacy reason), it is useful to have real-like data that can be shared more freely. There’s a decent amount of research (and growing interest) on the topic, see e.g. this paper by Dan Quintana;
When you don’t have the data you’re supposed to be building pipelines for (yet), it is useful to have data that is similar to what you expect receiving. That gives you a head start to start e.g. prototyping;
When you’re running internal training courses for staff, it is useful to use datasets that resemble what you actually work with (e.g. transactions, clients, etc.).

…the more you think about it, the more you realise how useful that can be!

In conclusion, both talks are excellent, straight to the point, and engaging. I recommend you give it a look, it will (literally) take just 10 minutes of your time.

See you soon!

Simulating survival times from a mixture cure model

Sat, 16 Jan 2021 23:00:00 GMT

Last Friday I was talking with a friend about simulating survival times from an experiment where most (if not all) events are happening within a short amount of time from inception. In other words, an experiment where the survival probability drops rapidly during the first days, staying then approximately flat until the end of the study.

This made me think about mixed cure fraction models, that is a model of the kind:

with being the proportion cured and the survival function for the uncured subjects. Assuming follows an exponential distribution (for simplicity and from now onward), this corresponds to a survival curve like the following:

where . With this model, the cure fraction creates an asymptote for the survival function, as % of subjects are deemed to be cured; of course, can be estimated from data. Interpretation of mixed cure fraction models is awkward, as one is assuming that subjects are split into cured and uncured at the beginning of the follow-up (which might not be the most realistic assumption), but discussing cure models is outside the scope of this blog post (and there are many papers you could find on the topic).

Back to the question, how to simulate from this survival function?

Well, as with most things these days, we can use the inversion method! First, we actually have to define which individuals are cured and which are not. To do so, we draw from a Bernoulli distribution with success probability :

set.seed(4875) # for reproducibility
n <- 100000
pi <- 0.2
cured <- rbinom(n = n, size = 1, prob = pi)
prop.table(table(cured))

cured
      0       1 
0.80074 0.19926

Then, we simulate survival times e.g. from an exponential distribution (with ). We use the formulae described in Bender et al. (which is in fact the inversion method applied to survival functions):

lambda <- 0.5
u <- runif(n = n)
S <- -log(u) / lambda

Finally, we assign to cured individuals an infinite survival time (as they will not experience the event ever):

S[cured == 1] <- Inf

Of course, now we are required to censor individuals, say after 20 years of follow-up:

status <- S <= 20
S <- pmin(S, 20)
df <- data.frame(S, status)

…and there you have it, it’s done.

What we want to do next is checking that we are actually simulating from the model we think we are simulating from. Let’s first check by fitting a Kaplan-Meier curve:

library(survival)
KM <- survfit(Surv(S, status) ~ 1, data = df)
plot(KM)
abline(h = pi, col = "red")

…which looks similar to the theoretical curve depicted above. Good!

Then, we could try to fit the data-generating model using maximum likelihood. The density function of the cure model is defined as

where and is the density function for the non-cured subjects (e.g. from the exponential distribution, from our example above). Therefore, the likelihood contribution for the i^th subject can be defined as:

In R, that (actually, the log-likelihood) can be defined as:

log_likelihood <- function(par, t, d) {
  lambda <- exp(par[1])
  pi <- boot::inv.logit(par[2])
  f <- (1 - pi) * lambda * exp(-lambda * t)
  S <- pi + (1 - pi) * exp(-lambda * t)
  L <- (f^d) * (S^(1 - d))
  l <- log(L)
  return(-sum(l))
}

Some comments on the above function:

I chose to model the parameter on the log-scale, and on the logit scale. By doing so, we don’t have to constrain the optimisation routine to respect the boundaries of the parameters, as and ;
f contains the density function of an exponential distribution (which is );
S contains the survival function of an exponential distribution;
I take the log of the likelihood function (which generally behaves better when optimising numerically) and return the negative log-likelihood value (as R’s optim minimises the objective function by default).

Now, all we have to do is define some starting values stpar and then run optim.

stpar <- c(ln_lambda = 0, logit_pi = 0)
optim(par = stpar, fn = log_likelihood, t = df$S, d = df$status, method = "L-BFGS-B")

$par
 ln_lambda   logit_pi 
-0.6945243 -1.3909736 

$value
[1] 185584.3

$counts
function gradient 
       8        8 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

The true values were:

log(lambda)

[1] -0.6931472

boot::logit(pi)

[1] -1.386294

…which (once again) are close enough to the fitted values. It does seem like we are actually simulating from the mixture cure model described above.

In conclusion, obviously this might not fit the actual experiment too well and therefore we might have to try other approaches before considering our work done. For instance, we could simulate from an hazard function that goes to zero after a given amount of time : this approach can easily be implemented using the {simsurv} package. The proof is left as an exercise for the interested reader.

Cheers!

Will R work on Apple Silicon?

Tue, 17 Nov 2020 23:00:00 GMT

Apple recently released new entry level devices using Apple-designed chips based on the ARM64 architecture, the so-called Apple Silicon that was introduced at WWDC earlier this year.

Reviews just started coming out, and everyone seems to be praising the performance of both optimised and translated software running on these new low-powered devices. Synthetic benchmarks and real-life tests are matching the performance of e.g. a MacBook Air to a 2019 16” MacBook Pro or even a 2017 iMac Pro. Which is mad!

Should I just run out and buy a new Mac?

As an R user, should you just run out and buy a new macOS device with Apple Silicon? Well, not so fast, tiger! A post on the R-developer blog dives deeper on the current status of R on Apple Silicon:

R seems to be running fine through the translation layer, but of course that is not optimised and performance should be worse than running natively;
R can already run on ARM32/ARM64 devices. Heck, it can even run on a Raspberry Pi Zero with a single-core 1 GHz CPU and 512 MB of RAM! However, before it can be compiled to Apple Silicon the whole stack of compilers will need to be updated/ported to the new architecture; work seems to be underway, so I believe it won’t take long.
There are inconsistencies with how NA/NaN are handled and propagated, but that is platform-specific and it is already complicated in the x86/x64 world. Writing code that can reliably preserve NA and NaN values will require ad-hoc checks.

This is just my quick executive summary, read through the blog post above for more details.

Some personal considerations

If you know me, you know how much I like tiny, silent computers. I mean, the MacBook Adorable is my favourite macOS laptop I never got to own and I got so close to buying a Surface Go 2 when it was announced earlier this year. If I had to get a desktop computer, I would probably get an Intel NUC or build into a small Mini-ITX case. And don’t get me started on Raspberry Pi boards! You bet I am very excited about the idea of a next-generation, fanless MacBook Air with great performance and excellent battery life!

Another potentially big thing is the Neural Engine that is embedded into Apple Silicon chips. Will R be able to link to that dedicated chip to accelerate machine learning (and potentially other) computations? Time will tell…

Finally, my understanding is that tools like brew and git are currently broken(-ish) on Apple Silicon. I am sure they will be updated for the new architecture soon, though.

Nevertheless, I am happy that the Mac is now an exciting platform once again after years of stagnation. I look forward to seeing how this evolves in the coming months, and I am definitely not upgrading my current laptop any time soon.

Update #1: 2020-11-23

There is now a thread on the R-SIG-Mac mailing list where Prof. Brian Ripley gives first impressions and benchmarks on building CRAN’s R 4.0.3 on a M1 MacBook Air with 8 GB of RAM.

First results seems very promising: building on top of Rosetta is actually faster than building on a 2016 2.0 GHz i5 MacBook Pro, but lots of work remains to be done to have a build ready for general use.

Furthermore, did you see those performance benchmarks for Mac-optimised TensorFlow training on Apple Silicon? I guess yes, machine learning computations will be greatly accelerated on new Apple hardware once fully optimised!

Multiple linear regression with {TMB}

Wed, 30 Sep 2020 22:00:00 GMT

I have been recently reading about the TMB framework to implement statistical models using automatic differentiation (and with a bunch of nice built-in features).

Automatic differentiation (in brief) is an algorithmic method that automagically and efficiently yields accurate derivatives of a given value. This is very interesting, as by using automatic differentiation we can easily get gradients and the Hessian matrix. Which are extremely useful when fitting models using the maximum likelihood method! On top of that, automatic differentiation is generally more efficient and accurate than symbolic or numerical differentiation.

There’s a bunch of examples of using {TMB} in practice, some are very simple and straightforward to follow, some are much more complex. I am just getting started, so I went straight to the linear regression example on the TMB documentation webpage. The C++ template for that simple model is:

#include 
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_VECTOR(Y);
  DATA_VECTOR(x);
  PARAMETER(a);
  PARAMETER(b);
  PARAMETER(logSigma);
  ADREPORT(exp(2*logSigma));
  Type nll = -sum(dnorm(Y, a+b*x, exp(logSigma), true));
  return nll;
}

…which can be compiled from within R and passed to any general-purpose optimiser (such as optim) to obtain maximum likelihood estimates for a linear regression model.

It is however interesting to generalise this to any number of covariates, using matrix-by-array multiplication to efficiently scale up any problem. This is relevant to implement general statistical modelling packages. Surprisingly, I had to fiddle around way more than I expected to do that: therefore, I thought about writing a blog post (which hopefully could be useful to others trying to get started with TMB)!

Disclaimer: I am no {TMB} nor C++ expert, so forgive me if I am missing something and let me know if there’s anything that needs fixing here.

So, let’s get started. We first need to write our C++ function for the negative log-likelihood function, which is a simple adaptation of the template we saw before for a single-covariate model:

#include 
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_VECTOR(Y);
  DATA_MATRIX(X);
  PARAMETER_VECTOR(b);
  PARAMETER(logSigma);
  Type nll = sum(dnorm(Y, X*b , exp(logSigma), true));
  return -nll;
}

Here we needed to change X to a DATA_MATRIX(), and b to a PARAMETER_VECTOR. Note that X here should be the design matrix of the model, e.g. obtained using the model.matrix() function in R, and that the matrix-array multiplication in C++ X*b is not the element-wise operation, and is equivalent to X %*% b in R. This is it: easy right?

Let’s now verify that this works as expected. We can compile the C++ function and dynamically link it using the tools from the {TMB} package (assuming model.cpp contains the function defined above):

library(TMB)
compile("model.cpp")
dyn.load(dynlib("model"))

We need to simulate some data now, e.g. with a single covariate for simplicity:

set.seed(42)
n <- 1000
x <- rnorm(n = n, mean = 50, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 0
y <- b0 + b1 * x + e

We create a dataset and a model matrix too:

df <- data.frame(y, x)
X <- model.matrix(y ~ x, data = df)
head(X)

  (Intercept)        x
1           1 63.70958
2           1 44.35302
3           1 53.63128
4           1 56.32863
5           1 54.04268
6           1 48.93875

Now we can use the MakeADFun() from {TMB} to construct the R object that brings all of this together:

f <- MakeADFun(
  data = list(X = X, Y = y),
  parameters = list(b = rep(0, ncol(X)), logSigma = 0),
  silent = TRUE
)

This passes the data X and y and sets the default values of the model parameters to zeros for all regression coefficients (b0 and b1 in this case) and for the log of the standard deviation of the residual errors.

We can now easily get the value of the (negative) log-likelihood function at the default parameters, gradients, and the Hessian matrix:

f$fn(x = f$par)

[1] 51351.44

f$gr(x = f$par)

          [,1]      [,2]      [,3]
[1,] -9994.682 -497251.6 -99865.01

f$he(x = f$par)

         [,1]       [,2]      [,3]
[1,]  1000.00   49741.76  19989.36
[2,] 49741.76 2574646.64 994503.22
[3,] 19989.36  994503.22 201730.02

The cool part is, we didn’t even have to define analytical formulae for gradients and the Hessian and we got it for free with automatic differentiation! Now all we have to do is to pass the negative log-likelihood function and the gradients to e.g. optim() and that’s all:

fit <- optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")
fit

$par
           b            b     logSigma 
 9.945838718  0.000981937 -0.014587405 

$value
[1] 1404.351

$counts
function gradient 
      35       35 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

If we exponentiate the value of log(Sigma), we obtain the standard deviation of the residuals on the proper scale: 0.9855. Compare this with the true values (b0 = 10, b1 = 0, log(Sigma) = 0) and with the results of the least squares estimator:

summary(lm(y ~ x, data = df))


Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9225 -0.6588 -0.0083  0.6628  3.5877 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9.9458454  0.1579727  62.959   <2e-16 ***
x           0.0009818  0.0031133   0.315    0.753    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9865 on 998 degrees of freedom
Multiple R-squared:  9.964e-05, Adjusted R-squared:  -0.0009023 
F-statistic: 0.09945 on 1 and 998 DF,  p-value: 0.7526

We are getting really close here!

Let’s now generalise this to multiple covariates, say 4 normally-distributed covariates:

set.seed(42)
n <- 1000
x1 <- rnorm(n = n, mean = 10, sd = 10)
x2 <- rnorm(n = n, mean = 20, sd = 10)
x3 <- rnorm(n = n, mean = 30, sd = 10)
x4 <- rnorm(n = n, mean = 40, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 1
b2 <- 2
b3 <- 3
b4 <- 4
y <- b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + e
df <- data.frame(y, x1, x2, x3, x4)
X <- model.matrix(y ~ x1 + x2 + x3 + x4, data = df)
head(X)

  (Intercept)        x1       x2        x3       x4
1           1 23.709584 43.25058 32.505781 33.14338
2           1  4.353018 25.24122 27.220760 32.07286
3           1 13.631284 29.70733 12.752643 35.92996
4           1 16.328626 23.76973  9.932951 28.51329
5           1 14.042683 10.04067 17.081917 51.15760
6           1  8.938755 14.02517 33.658382 31.20543

All we have to do is re-create the f object using the MakeADFun() function and the new data:

f <- MakeADFun(
  data = list(X = X, Y = y),
  parameters = list(b = rep(0, ncol(X)), logSigma = 0),
  silent = TRUE
)

Let’s now fit this with optim() and compare with the least squares estimator:

optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")

$par
         b          b          b          b          b   logSigma 
9.90325169 0.99992371 2.00125295 2.99795882 4.00296201 0.01661648 

$value
[1] 1435.576

$counts
function gradient 
     103      103 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

summary(lm(y ~ x1 + x2 + x3 + x4, data = df))


Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3354 -0.7382  0.0161  0.6893  3.0315 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9.903788   0.177893   55.67   <2e-16 ***
x1          0.999929   0.003218  310.72   <2e-16 ***
x2          2.001249   0.003277  610.69   <2e-16 ***
x3          2.997959   0.003134  956.53   <2e-16 ***
x4          4.002949   0.003266 1225.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared:  0.9997,    Adjusted R-squared:  0.9997 
F-statistic: 7.294e+05 on 4 and 995 DF,  p-value: < 2.2e-16

Very close once again, and if you noticed, we didn’t have to change a single bit of code from the first example with a single covariate. Isn’t that cool?

I know, I know, maybe it’s cool for a specific type of person only, but hey…

Another nice thing is that given that we have gradients (in compiled code!), maximising the likelihood will be faster and generally more accurate if using a method that relies on gradients, such as the limited-memory modification of the BFGS quasi-Newton method that we chose here. See the following benchmark as a comparison:

library(bench)
bench::mark(
  "with gradients" = optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B"),
  "without gradients" = optim(par = f$par, fn = f$fn, method = "L-BFGS-B"),
  iterations = 20,
  relative = TRUE,
  check = FALSE
)

# A tibble: 2 × 6
  expression          min median `itr/sec` mem_alloc `gc/sec`
                          
1 with gradients     1      1         3.85         1      NaN
2 without gradients  3.89   3.83      1            1      Inf

In this case (a small and simple toy problem) the optimisation using gradients is ~4 times faster (on my quad-core MacBook Pro with 16 Gb of RAM). And all of that for free, just using automatic differentiation!

Remember that, if not provided with a gradient function, optim() will use finite differences to calculate the gradients numerically. Then, the benchmark above gives us a direct comparison with numerical differentiation too.

To wrap up, hats off to the creators of the {TMB} package for providing such an easy and powerful framework, and I didn’t even scratch the surface here: this is just a quick and dirty toy example with multiple linear regression, which doesn’t event needs maximum likelihood.

Anyway, I’m definitely going to come back to automatic differentiation and {TMB} in future blog posts, stay tuned for that. Cheers!

Update: Benjamin Christoffersen shared on Twitter a link to a thread on Cross Validated where the differences between maximum likelihood and least squares are discussed in more detail. It’s a very interesting read, remember to check it out!

September packages updates

Wed, 02 Sep 2020 22:00:00 GMT

Hey! I hope y’all had a good summer, it sure has been something…

Let’s get straight to business: this is a short post to announce that new releases of {rsimsum} and {KMunicate} just landed on CRAN! {rsimsum} is now at version 0.9.1, while {KMunicate} is now at version 0.1.0.

Both are mostly maintenance releases: some small bugs have been squashed, and new (hopefully useful) customisation options have been added to {KMunicate}; some typos in the documentation have been fixed too. If you want to read more, all details can be found in the NEWS files of {rsimsum} and {KMunicate}.

If you have either package already installed, just use update.packages() to obtain the new version; otherwise, install.packages("rsimsum") and install.packages("KMunicate") will do the trick. As always, feedback on the new releases is very much appreciated.

That’s all: I promised this was going to be short, and less than 150 words later I believe I delivered. But don’t worry, I’ll be back soon (-ish) with a follow-up on my post on academic conferences in the year 2020 after a whole season of remote conferences. In the meanwhile, take care and be safe!

Introducing the {KMunicate} package

Thu, 06 Aug 2020 22:00:00 GMT

A few weeks ago, Tim Morris launched a competition on Twitter challenging his followers to write some code to create KMunicate-style Kaplan-Meier plots. I (obviously) took on the challenge, and I must admit: I got slightly carried away… hence now introducing the {KMunicate} R package.

{KMunicate} is now on CRAN, and the development version lives on my GitHub profile. You can install the CRAN version as usual:

install.packages("KMunicate")

Alternatively, you can install the dev version of {KMunicate} from GitHub with:

# install.packages("devtools")
devtools::install_github("ellessenne/KMunicate-package")

What’s the KMunicate style?

KMunicate-style Kaplan-Meier plots include confidence intervals for each fitted curve and an extended table beneath the main plot including the number of individuals at risk at each time and the cumulative number of events and censoring events. Here’s an example from the KMunicate study itself:

As you might imagine, that’s quite a lot of work to produce a plot like that.

Well, not anymore!

The {KMunicate} package lets you create such a plot with a single line of code. Isn’t that great?

Let’s illustrate the basic functionality of {KMunicate} with an example. We’ll be using once again data from the German breast cancer study, which is conveniently bundled with {KMunicate}:

data(brcancer, package = "KMunicate")
head(brcancer)

  id hormon x1 x2 x3 x4 x5 x6  x7 rectime censrec x4a x4b        x5e
1  1      0 70  2 21  2  3 48  66    1814       1   1   0 0.69767630
2  2      1 56  2 12  2  7 61  77    2018       1   1   0 0.43171051
3  3      1 58  2 35  2  9 52 271     712       1   1   0 0.33959553
4  4      1 59  2 17  2  4 60  29    1807       1   1   0 0.61878341
5  5      0 73  2 35  2  1 26  65     772       1   1   0 0.88692045
6  6      0 32  1 57  3 24  0  13     448       1   1   1 0.05613476

The survival time is in rectime, and the event indicator variable is censrec; the treatment variable is hormon, a binary covariate.

First, we fit the survival curve by treatment arm using the Kaplan-Meier estimator:

library(survival)
fit <- survfit(Surv(rectime, censrec) ~ hormon, data = brcancer)
fit

Call: survfit(formula = Surv(rectime, censrec) ~ hormon, data = brcancer)

           n events median 0.95LCL 0.95UCL
hormon=0 440    205   1528    1296    1814
hormon=1 246     94   2018    1918      NA

The plot that can be obtained via the plot method is ok but needs a bit of work to be good enough for a publication. For instance, this is the default:

plot(fit)

No bueno, right? Let’s improve it a bit:

plot(fit, col = 1:2, lty = 1:2, conf.int = TRUE)
legend("bottomleft",
  col = 1:2, lty = 1:2,
  legend = c("Control", "Treatment"), bty = "n"
)

This is better, but still not great: the area defined by the confidence intervals is not shaded, and there is still no risk table.

Here’s when the {KMunicate} package comes to the rescue. First, we need to define the breaks for the x-axis; the risk table with be computed at those breaks. Say we want breaks every year:

time_breaks <- seq(0, max(brcancer$rectime), by = 365)
time_breaks

[1]    0  365  730 1095 1460 1825 2190 2555

Then, all we have to do is to

library(ggplot2)
library(KMunicate)

KMunicate(fit = fit, time_scale = time_breaks)

Easy peasy!

We might want to get proper arm labels too:

brcancer$hormon <- factor(brcancer$hormon, levels = 0:1, labels = c("Control", "Treatment"))
fit <- survfit(Surv(rectime, censrec) ~ hormon, data = brcancer)

KMunicate(fit = fit, time_scale = time_breaks)

Nice. Next, we’ll show how to customise the plot.

Customising KMunicate-style plots

First, we might want to customise colours to use a colour-blind friendly palette via the .color_scale and .fill_scale arguments:

KMunicate(
  fit = fit,
  time_scale = time_breaks,
  .color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
  .fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2")
)

Then, we might want to use a custom font, such as my latest obsession Victor Mono, via the .ff argument:

KMunicate(
  fit = fit,
  time_scale = time_breaks,
  .color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
  .fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
  .ff = "Victor Mono"
)

Finally, we customise the overall theme using e.g. theme_minimal from the {ggplot2} package:

KMunicate(
  fit = fit,
  time_scale = time_breaks,
  .color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
  .fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
  .ff = "Victor Mono",
  .theme = ggplot2::theme_minimal(base_family = "Victor Mono")
)

When overriding the default theme, we need to re-define the font for the main plot using the base_family argument of a theme_* component. Overall, I think this is a much better plot!

Exporting plots

The final step consists of exporting a plot for later use, e.g. in manuscripts or presentations. That’s straightforward, being the output of KMunicate() a ggplot2-type object: all we have to do is use the ggplot2::ggsave function, e.g. in the next block of code.

p <- KMunicate(
  fit = fit,
  time_scale = time_breaks,
  .color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
  .fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
  .ff = "Victor Mono",
  .theme = ggplot2::theme_minimal(base_family = "Victor Mono")
)

ggplot2::ggsave(p, filename = "export.png", height = 6, width = 6, dpi = 300)

Closing remarks

Further details on {KMunicate} can be found on its website, with more examples and a better explanation of the different arguments and customisation options. Let me know if you find the package useful, and if you find any bug (I’m sure there’ll be some) please file an issue on GitHub.

And what about Tim’s challenge that led to the inception of {KMunicate}, you might ask? Well, I got myself a beautiful hand-crafted wooden spoon:

Isn’t that great!?

Weekly blog news

Wed, 03 Jun 2020 22:00:00 GMT

…here we go again with the weekly update on blog maintenance and housekeeping. I promise the fun content will be back soon!

Talks section is live

As I mentioned in the previous blog post, I have been working on a new section will a list of talks I have given in the past few years, including slides (whenever possible).

Well, that section now live! The cool thing is that it builds dynamically using Hugo content, and it can easily be updated by simply adding a new markdown file with the appropriate YAML header. I think the overly popular Academic theme uses a similar approach for the various kind of contents it supports, but I am not sure.

Analytics are gone

I had been thinking about this for a while, and I finally decided to remove all analytics from all the websites I run: basically, this website and the pkgdown websites for the {rsimsum} and {comorbidity} packages.

I mean, it’s cool to see how many people visit each every month, and I was flattered to see people from all over the world accessing my website… but I don’t really need any of that data (funny thing, coming from a data all the things kind of person). You don’t need further tracking while you visit the web, that for sure.

Incidentally, I recently bought a Raspberry Pi Zero W on which I am now running Pi-hole, a DNS sinkhole that blocks an incredible amount of junk at a network level. I mean, look at all that crap:

… and that’s just over the past 24 hours!

An additional bonus is less JavaScript to load, which means that… the website is even faster! Yeah, I know, it must be an ongoing joke by now.

Anyway

I want to play around with the Raspberry Pi more, so I will be probably writing about it more. It’s a fun, tiny little computer that can do all sorts of computer stuff — and if you didn’t know, I have an irrational obsession for tiny and cute little computers!

I will keep it short: this is all for now, see you next week for a new episode of blog updates weekly, a new series on Netflix. Just kidding, Netflix is totally not paying for this… right? Happy to be proved wrong here though 😂 Cheers!