```
library(rsimsum)
data("MIsim", package = "rsimsum")
s <- simsum(
data = MIsim,
estvarname = "b",
true = 0.50,
se = "se",
methodvar = "method",
ref = "CC"
)
```

You might already know that a couple of years ago Betty Syriopoulou published a tutorial paper on standardised survival probabilities, a useful tool to supplement and enhance the reporting of regression models for time-to-event (survival) outcomes.

The background, in short, is that while hazard ratios are ubiquitous in medical research with time-to-event outcomes, they are often misreported, misinterpreted, and have several, well-known limitations; I will not go much into detail, but some relevant references if you want to read more about this topic are:

- Betty’s tutorial paper on standardised survival probabilities (of course);
- The popular paper by Miguel Hernán on
*The Hazards of Hazard Ratios*; - The paper by Mats Stensrud and Miguel Hernán questioning
*Why Test for Proportional Hazards*.

Standardised survival probabilities provide a useful and easy-to-interpret measure to complement hazard ratios, can be calculated after fitting survival models (and can therefore include covariates to adjust for, e.g., because they are confounders for a certain exposure-outcome association of interest), and can be contrasted to produce standardised survival probability differences (i.e., risk differences) and ratios (i.e., risk ratios). As a bonus, if the confounders that you decide to include in your regression model are sufficient to control for confounding, then the estimated risk differences (or ratios) can be interpreted as population causal effects: *nice*! More on this causal interpretation here and here.

The tutorial paper is easy to follow and illustrates the utility of such measures very nicely, but most importantly, Stata code to reproduce the analyses and use this method in practice is provided in the supplementary material.

Unfortunately, only Stata code was published with the paper, but Betty and I have been talking about porting this to R for a while now, and guess what: today is your lucky day! We have just published on my GitHub profile a repository with R code that can be used to replicate every result from the paper, including every plot that was included in the tutorial. The GitHub repository can be found here (https://github.com/ellessenne/standsurv-tutorial-r), and the code is released under the MIT license, so you can pretty much do whatever you want with it.

Here is a little preview of what you will be able to accomplish:

…and there you have it, don’t forget to check out the code and go test this out for your next project, predict standardised probabilities, make some nice plots, complement your hazard ratios, and take your survival analyses to the next level.

Finally, please note that the code in the repository is quite specific to this project and not very general, so you will have to do a little bit of work to get this going, but there should be enough comments to give you all the information you will possibly need to extend it. If you have any questions, feel free to get in touch or just open an issue in the GitHub repository. As always, any feedback is more than welcome.

Before we go, here is one more treat for you, if you are still reading: an R package with a general, easy to use implementation of regression standardisation *may or may not* be in development. Stay tuned for more… 👀

Let’s get straight to business: a new release of the {rsimsum} package just landed on CRAN, and you can install it with

`install.packages("rsimsum")`

It’s a quite large release that is more than a year in the making, with lots of new features, a few bug fixes, and a lot of internal housekeeping (that you don’t need to worry about, this should not affect any user-facing behaviour – but please get in touch if it does). I’ll summarise a few things worth highlighting in this blog post; everything else is listed in the changelog page on the package website.

As always, I am extremely grateful to all users of {rsimsum} who use and break the package and come up with bug reports and feature suggestions: the package is much better because of you. And if you have further bug reports or suggestions, or any feedback on the package really, feel free to get in touch (contact details here) or post on GitHub.

We start with relative bias, which is now a supported summary measure that is reported by default, including Monte Carlo errors. I hinted at this in a previous blog post, and it’s finally here!

Let’s have a look at this in more detail. Specifically, we use the `MIsim`

dataset, which is bundled with {rsimsum}:

```
library(rsimsum)
data("MIsim", package = "rsimsum")
s <- simsum(
data = MIsim,
estvarname = "b",
true = 0.50,
se = "se",
methodvar = "method",
ref = "CC"
)
```

If we summarise this, we can now pick `bias`

and `rbias`

as performance measures:

`summary(s, stats = c("bias", "rbias"))`

```
Values are:
Point Estimate (Monte Carlo Standard Error)
Bias in point estimate:
CC MI_LOGT MI_T
0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)
Relative bias in point estimate:
CC MI_LOGT MI_T
0.0335 (0.0096) 0.0018 (0.0083) -0.0024 (0.0085)
```

…and that’s it, as easy as that.

Of course, this won’t work if the true value of the estimand is zero, but that is not surprising (as we divide by zero):

```
simsum(
data = MIsim,
estvarname = "b",
true = 0,
se = "se",
methodvar = "method",
ref = "CC"
) |>
summary(stats = "rbias")
```

```
Values are:
Point Estimate (Monte Carlo Standard Error)
Relative bias in point estimate:
CC MI_LOGT MI_T
NaN (NaN) NaN (NaN) NaN (NaN)
```

Nonetheless, this provides a useful alternative that allows quantifying bias on a relative scale (compared to the absolute scale of *plain* bias). If you are looking for more details, formulae are described in the introductory vignette (which has been updated accordingly).

Nested loop plots can now accommodate designs that are not fully factorial. Let’s start with a fully factorial design, as in the `nlp`

dataset:

```
data("nlp", package = "rsimsum")
s.nlp <- simsum(
data = nlp,
estvarname = "b",
true = 0,
se = "se",
methodvar = "model",
by = c("baseline", "ss", "esigma")
)
library(ggplot2)
autoplot(s.nlp, stats = "bias", type = "nlp")
```

This has not changed since the previous release. If we however have an incomplete design, missing scenarios will now be empty:

```
nlp.subset <- subset(nlp, !(nlp$ss == 100 & nlp$esigma == 2))
s.nlp.subset <- simsum(
data = nlp.subset,
estvarname = "b",
true = 0,
se = "se",
methodvar = "model",
by = c("baseline", "ss", "esigma")
)
autoplot(s.nlp.subset, stats = "bias", type = "nlp")
```

This should allow using nested loop plots in a wider variety of scenarios. Many thanks to Mike Sweeting for the suggestion.

Lastly, it is now possible to further customise the appearance of zipper plots; specifically, the horizontal lines that are used to denote the estimated confidence intervals (based on Monte Carlo errors) of coverage probability under each scenario. The default behaviour hasn’t changed, with yellow bands:

```
data(relhaz, package = "rsimsum")
relhaz <- subset(relhaz, relhaz$baseline == "Weibull" & relhaz$model != "RP(2)")
s.zip <- simsum(
data = relhaz,
estvarname = "theta",
se = "se",
true = -0.50,
methodvar = "model",
by = "n",
x = TRUE
)
autoplot(s.zip, type = "zip")
```

What is nice is that now you can use the `zip_ci_colors`

argument of `autoplot()`

to customise this. If you pass a single colour name (or hex value), that will be used throughout:

`autoplot(s.zip, type = "zip", zip_ci_colours = "purple")`

If you pass two values, those will be used for optimal and sub-optimal coverage, respectively:

`autoplot(s.zip, type = "zip", zip_ci_colours = c("green", "red"))`

Finally, if you pass three values, those will be used for optimal coverage, under-coverage, and over-coverage, respectively:

`autoplot(s.zip, type = "zip", zip_ci_colours = c("green", "blue", "red"))`

Of course, I would suggest picking better colours for a publication-worthy plot, but you get the gist. Many thanks to Lorenzo Guizzaro for the suggestion.

Long time no see, right?

The last post I wrote on this blog is almost a year old, and a lot has happened in the meantime, behind the scenes. I won’t go into details (if you know, you know!), but 2023 has been a tough year for a bunch of different reasons. But *we are back, baby!*, and this time I will try to contribute content more consistently.

If you are new around here, you’ll notice that things look a bit different. For the keen observers in the audience this won’t be a surprise, but this blog is now powered by the amazing Quarto, with a custom, hand-crafted theme to bring the old website design into 2024. Gone is the old International Orange; welcome to Very Peri, the former Pantone color of the year for 2022. The new main typeface is Atkinson Hyperlegible for improved accessibility, and Ubuntu Condensed and Fira Code are used for headers and code, respectively.

I think the new design is great, and I hope you’ll like it too. Please have a look around and let me know if you spot anything broken.

I am keeping this post short and sweet, so that’s it for now, but I’ll be back very soon with some news I have already started working on. In the meantime, hope you had a nice holiday season and a happy new year!

]]>Happy new year! I hope you had a relaxing holiday season and that 2023 is treating you well so far.

Well, here’s another treat for you: today we are going to make a *dumbbell plot* from scratch, using our dear old friend {ggplot2}. Something quick and easy to get going in 2023, but fun nonetheless - and hopefully, useful too. Let’s start by defining what a dumbbell plot actually is:

A dumbbell plot (also known as a dumbbell chart, or connected dot plot) is great for displaying changes between two points in time, two conditions, or differences between two groups.

Source: amcharts.com

You might have seen this before in one of the nice visualisations that the OECD publishes from time to time:

As you can see, this is an intuitive way of showing how a certain metric has changed between two points in time. Let’s get going then, shall we?

For this example, we will use data on monthly step counts that yours truly logged in 2021 and 2022. One of the things I wanted to do more of in 2022, compared to 2021, was walking; will I have succeeded with that? Well, we’ll find out soon.

I extracted this data from Garmin Connect, as I have been wearing a Garmin watch for the past few years now, and this is stored in a dataset named `dt`

:

`head(dt)`

```
# A tibble: 6 × 3
Month X2021 X2022
<fct> <dbl> <dbl>
1 January 114171 194624
2 February 118548 223310
3 March 105853 224946
4 April 172499 206213
5 May 158913 246563
6 June 166119 244314
```

A very simple dataset, not much to see here. Let’s start building our plot: first, we create a `ggplot`

object and put the different months on the vertical axis:

```
library(ggplot2)
db_plot <- ggplot(dt, aes(y = Month))
db_plot
```

Not much to see yet. Then, we add a set of points for 2021 data:

```
db_plot <- db_plot +
geom_point(aes(x = X2021, color = "2021"))
db_plot
```

Same as before, but we add data for 2022:

```
db_plot <- db_plot +
geom_point(aes(x = X2022, color = "2022"))
db_plot
```

It’s coming together nicely, isn’t it? Now, we add a segment to join the two sets of points:

```
db_plot <- db_plot +
geom_segment(aes(yend = Month, x = X2021, xend = X2022))
db_plot
```

And there you have it. Thank you for reading and… wait! We are not done here, of course - now we need to turn this into a nice plot.

One problem here is that the segment overlaps the data points: that looks ugly. To solve this, we need to rebuild the plot but add the segment geometry first:

```
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022)) +
geom_point(aes(x = X2021, color = "2021")) +
geom_point(aes(x = X2022, color = "2022"))
```

Already better! Then, let’s make the data points larger and change the colour of the *bar* to grey:

```
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
geom_point(aes(x = X2021, color = "2021"), size = 3) +
geom_point(aes(x = X2022, color = "2022"), size = 3)
```

Let’s add a better scale for the horizontal axis; for this, we use the `comma()`

function from the {scales} package:

```
library(scales)
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
geom_point(aes(x = X2021, color = "2021"), size = 3) +
geom_point(aes(x = X2022, color = "2022"), size = 3) +
scale_x_continuous(labels = comma)
```

Let’s label the plot correctly:

```
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
geom_point(aes(x = X2021, color = "2021"), size = 3) +
geom_point(aes(x = X2022, color = "2022"), size = 3) +
scale_x_continuous(labels = comma) +
labs(x = "Steps", y = "", color = "Year")
```

Now, the final touches: let’s change the theme of the plot and tidy things up a little:

```
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
geom_point(aes(x = X2021, color = "2021"), size = 3) +
geom_point(aes(x = X2022, color = "2022"), size = 3) +
scale_x_continuous(labels = comma) +
theme_bw(base_size = 12) +
theme(legend.position = "bottom", plot.margin = unit(x = rep(1, 4), units = "lines")) +
labs(x = "Steps", y = "", color = "Year", title = "Monthly Steps Walked, 2022 vs 2021")
```

We can also simplify the above by turning our data into long format:

```
library(tidyr)
dt_long <- pivot_longer(data = dt, cols = starts_with("X"))
dt_long$name <- factor(dt_long$name, levels = c("X2021", "X2022"), labels = c("2021", "2022"))
head(dt_long)
```

```
# A tibble: 6 × 3
Month name value
<fct> <fct> <dbl>
1 January 2021 114171
2 January 2022 194624
3 February 2021 118548
4 February 2022 223310
5 March 2021 105853
6 March 2022 224946
```

The required code is very similar to what was used above, but we can now easily modify the colour palette too, and improve the title using {ggtext}:

```
library(ggtext)
ggplot(dt, aes(y = Month)) +
geom_segment(aes(yend = Month, x = X2021, xend = X2022), color = "grey50") +
geom_point(data = dt_long, aes(x = value, color = name), size = 3) +
scale_x_continuous(labels = comma) +
scale_color_manual(values = c("#F5DF4D", "#6667AB")) +
theme_bw(base_size = 12) +
theme(
legend.position = "none",
plot.title = element_markdown(),
plot.margin = unit(x = rep(1, 4), units = "lines")
) +
labs(
x = "Steps",
y = "",
color = "Year",
title = "Monthly Steps Walked,
<span style='color:#6667AB;'>2022</span>
vs
<span style='color:#F5DF4D;'>2021</span>"
)
```

And yes, for all you colours nerds out there: the two hex codes are Pantone’s colours of the year for 2021 and 2022, “Illuminating” and “Very Peri”. Fitting, right?

And yes, *I did walk more in 2022 compared to 2021*, it turns out! Well, except for October, but to be fair, I did run 120 km that month in 2021 compared to a shameful 0 km in 2022, so it could have been worse…

So there you have it: a short tutorial on building a dumbbell plot from scratch using {ggplot2} and other freely available tools, and making it *nice* (subjectively, of course). Other options for making dumbbell plots in R *do exist*, of course, such as the {ggalt} package - make sure you check that out too. And until next time, take care!

Today we will be talking about simulation studies and Monte Carlo errors. Yeah, I know, you heard it all before, but trust me… there will be something interesting in here, I swear!

Let’s start with the basics: Monte Carlo simulation is a stochastic procedure, thus for each run of a simulation study, we are likely to get *slightly* different results. And that’s fine! However, we want to be able to reproduce simulation results. Sure, we could *set the seed of the random numbers generator*, but what if we want the results to be similar regardless of what seed we set?

This concept is often referred to as

statistical reproducibility: in other words, we want to minimise simulation error for results to be similar across replications.

This simulation error is often referred to as *Monte Carlo error*. And as you might already know, the {rsimsum} package can calculate Monte Carlo error for a variety of performance measures, such as bias.

However, what if you want to calculate Monte Carlo error for *any* potential performance measure out there? Well, today is your lucky day: as suggested by White, Royston, and Wood (2011) in the settings of multiple imputation, we can calculate the Monte Carlo error using a jackknife procedure to our simulation results. That is, the Monte Carlo error for any performance measure will be the standard error of the mean of the pseudo values for that statistic, computed by omitting one simulation repetition at a time.

Throughout this post, I will introduce all the above concepts, and I will show you how to use the jackknife to compute the Monte Carlo standard error for relative bias. We will also validate our calculations, which is always a good thing to do!

The jackknife technique is, *technically speaking*, a cross-validation technique and thus it can be considered to be a form of resampling. Specifically, given a sample of size *n*, a jackknife estimator can be built by aggregating the parameter estimates from each subsample of size *(n - 1)*, where the *n*^{th} observation is omitted each time. Loosely speaking, this is not far from the concept of leave-one-out cross-validation.

If we define a given statistics θ, for a sample of size *n*, we can calculate *n* jackknife replicates, denoted with θ^{n}, each computed by omitting the *n*^{th} observation. Then, we can calculate the mean and the variance (and thus standard error) of such jackknife replicates, and there you go.

I know it is a short and *not comprehensive at all* description: if you want to read more about this, check the Wikipedia page on jackknife resampling and this paper by Bradley Efron.

We now move on to relative bias. First, let’s define bias: according to Morris, White, and Crowther, that is defined (for a certain estimand θ) as: and estimated by assuming *n* repetitions.

This is routinely computed (and reported) for, anecdotally, most simulation studies in the literature. However, the magnitude of bias depends on the magnitude of θ; it is thus not necessarily easy to compare across studies with different data-generating mechanisms. That’s why it is sometimes interesting to compute relative bias, usually defined as and estimated by The obvious downside is that relative bias, as defined here, cannot be calculated when the true θ = 0, but that’s beyond the scope of this blog post.

Let’s not illustrate how to use the jackknife to calculate the Monte Carlo standard error of relative bias. To do so, we will use a dataset that comes bundled with {rsimsum} for illustration purposes:

```
library(rsimsum)
data("MIsim")
str(MIsim)
```

```
tibble [3,000 × 4] (S3: tbl_df/tbl/data.frame)
$ dataset: num [1:3000] 1 1 1 2 2 2 3 3 3 4 ...
$ method : chr [1:3000] "CC" "MI_T" "MI_LOGT" "CC" ...
$ b : num [1:3000] 0.707 0.684 0.712 0.349 0.406 ...
$ se : num [1:3000] 0.147 0.126 0.141 0.16 0.141 ...
- attr(*, "label")= chr "simsum example: data from a simulation study comparing 3 ways to handle missing"
```

`MIsim <- subset(MIsim, MIsim$method == "CC")`

Remember that, for this specific dataset, the true value of the estimand is θ = 0.5. We will also be using a single method (in this case, `CC`

) for simplicity, but this can obviously be generalised. Bias can be easily computed using the `simsum()`

function:

```
sb <- simsum(data = MIsim, estvarname = "b", true = 0.5, se = "se")
tidy(summary(sb, stats = "bias"))
```

```
stat est mcse lower upper
1 bias 0.01676616 0.004778676 0.007400129 0.02613219
```

Let’s now calculate relative bias, by hand, together with bias (to compare with the output of `simsum()`

):

```
data.frame(
bias = mean(MIsim$b) - 0.5,
relative_bias = (mean(MIsim$b) - 0.5) / 0.5
)
```

```
bias relative_bias
1 0.01676616 0.03353232
```

We get the same results: good! Then, let’s use the jackknife to calculate the Monte Carlo standard error.

First, let’s calculate all jackknife (leave-one-out) replicates and plot their distribution:

```
jk.estimate <- vapply(
X = seq_along(MIsim$b),
FUN = function(x) (mean(MIsim$b[-x] - 0.5) / 0.5), # Relative bias
FUN.VALUE = numeric(1)
)
hist(x = jk.estimate, xlab = "Jackknife Replicates", main = "")
```

Then, let’s calculate the jackknife standard error, using the following formula:

```
n <- length(jk.estimate)
relative_bias_mcse <- sqrt(((n - 1) / n) * sum((jk.estimate - mean(jk.estimate))^2))
relative_bias_mcse
```

`[1] 0.009557351`

…and there you have it! We can also validate this procedure by calculating the jackknife Monte Carlo standard errors for bias:

```
jk.estimate <- vapply(
X = seq_along(MIsim$b),
FUN = function(x) mean(MIsim$b[-x] - 0.5), # Bias
FUN.VALUE = numeric(1)
)
n <- length(jk.estimate)
bias_mcse <- sqrt(((n - 1) / n) * sum((jk.estimate - mean(jk.estimate))^2))
bias_mcse
```

`[1] 0.004778676`

Recall the results from `simsum()`

:

`tidy(summary(sb, stats = "bias"))`

```
stat est mcse lower upper
1 bias 0.01676616 0.004778676 0.007400129 0.02613219
```

Once again, this is exactly the same, which is good.

It turns out, we *don’t* actually need the jackknife to obtain Monte Carlo standard errors (MCSE) for relative bias (RB). It can be shown (as usual, *exercise left to the reader*) that we can obtain a closed-form formula:

where RB hat is the estimated relative bias computed using the estimator above. Let’s compare this with the jackknife estimate. First, recall the estimated MCSE using the jackknife:

`relative_bias_mcse`

`[1] 0.009557351`

Using the formula above:

```
n <- length(MIsim$b)
rb <- (mean(MIsim$b) - 0.5) / 0.5
sqrt(1 / (n * (n - 1)) * sum(((MIsim$b - 0.5) / 0.5 - rb)^2))
```

`[1] 0.009557351`

…and they’re the same. Cool!

If you are still not convinced, let me show you something else. What is another method that one could use to estimate standard errors *for any statistics you can pretty much think of*? Well, but of course, it’s our best friend the bootstrap!

Let’s use non-parametric bootstrap to calculate the standard error of our relative bias estimator. For that, we use the {boot} package in R:

```
library(boot)
set.seed(387456)
rbfun <- function(data, i, .true) mean((data[i] - .true) / .true)
bse <- boot(data = MIsim$b, rbfun, .true = 0.5, R = 50000)
```

Note that `rbfun()`

is a function we define to calculate relative bias, that we use R = 50000 bootstrap samples to ensure convergence of the procedure, and that we set a seed (for reproducibility). The results of the bootstrap are printed below:

`bse`

```
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = MIsim$b, statistic = rbfun, R = 50000, .true = 0.5)
Bootstrap Statistics :
original bias std. error
t1* 0.03353232 -4.320035e-05 0.009571821
```

Once again: pretty close, not exactly the same (as the bootstrap is still a stochastic procedure), but close enough that we can be confident that they converge to the same value.

Finally, let me reassure you that the bootstrap has converged (and, *side note*, you should remember to do that too when you use it):

```
cmsds <- vapply(
X = seq_along(as.numeric(bse$t)),
FUN = function(i) sd(bse$t[1:i]),
FUN.VALUE = numeric(1)
)
plot(
x = cmsds,
type = "l",
ylab = "Boostrap SE",
xlab = "Bootstrap Iteration"
)
```

…looks like the bootstrap converged, and that, probably, we did not need so many bootstrap samples after all. Luckily computations were so inexpensive that we could afford it (the whole thing took just a few seconds on my laptop), so that’s ultimately fine.

And there you have it, examples of how to use the jackknife (and the bootstrap!) to estimate standard errors for a given metric of interest. Time to go wild and apply this to your settings!

In conclusion, I think these are very powerful tools that every statistician should be at least familiar with; there is plenty of literature on the topic, let me know if you’d like some references. Hope you learned something from this, I sure did by writing up all of this – and if you are still reading, here’s some breaking news:

Relative bias with Monte Carlo errors will be available in the next release of {rsimsum}! Coming soon to your nearest CRAN server…

Thanks for reading, and until next time, take care!

And yes, that is a 13-inch MacBook Pro in the back and that’s a microSD card for storage:

I mean, look how tiny and cute it looks in its red-and-white case:

The *whole* case is barely thicker than the MacBook Pro itself, while the board is exactly 65 by 30 mm. It is, *of course*, not a powerhouse, being powered by a single-core, 1 GHz, 32-bit CPU with 512 MB of RAM, but that is more than enough for Pi-hole and some tinkering. And did I mention that it costs only $10? No? Well, isn’t that great?

Anyway, what you might not know is that, being powered by a full Debian-based distribution, you can install a variety of software using the `apt`

package manager. Including R which, however, is often lagging behind the latest CRAN release (I could only install R 3.5.2 using `apt`

).

Interestingly, a couple of days ago while I was browsing the schedule for rstudio::conf(2022), I came across the R for the Raspberry Pi project. It aims to provide up-to-date builds of R for Raspberry Pi computers, which can be installed in few simple steps. I mean, I obviously had to check it out!

…here we are, connecting to the headless Pi via SSH on my macOS terminal:

Nice, even though the latest available version (for my Pi Zero W, at least) is only R 4.1.2.

The next obvious question is: can you actually do something with R on such a low-powered computer?

To answer this question, I set up a short benchmark script that simulates data from a logistic regression model, for 1,000 subjects, and then fits the corresponding, true model 100 times (using the {microbenchmark} package). The script is available in the following Gist if you want to try running it on your machine for comparison:

Here are the results:

The median time was 106.4 milliseconds, with an interquartile interval of 105.9 to 118.4 milliseconds. Not bad! For comparison, let’s run the benchmark script on my 2019, 13-inch MacBook Pro (2.4 GHz, quad-core i5 processor, an i5-8279U, with 16 GB of RAM). That can be done in just a couple of lines, directly from your R session:

```
gist_url <- "https://gist.githubusercontent.com/ellessenne/5e0c35a7a0625d9dd5bbe8c284d52155/raw/9392820a5f53d8f438a19f1518e30eeca1df30f3/pizw-bench.R"
source(gist_url)
```

```
Unit: milliseconds
expr min lq mean median uq max neval
logistic regression 2.364326 2.817422 3.720309 3.29978 3.823631 12.73771 100
```

Okay, I guess the Pi Zero W is *slow* compared to my relatively modern laptop, approximately 30 to 40 times slower… but hey! It’s machine learning on a $10 computer, isn’t that awesome?

It’s clearly not enough for large projects, but a fantastic option for democratising data science, thanks to open-source software and a tiny $10 computer. If you want to read more about the concept, don’t forget to check out this early draft by the team of Jeff Leek on the importance of *democratising data science education*. The future is bright!

I was in class a few weeks ago helping with a course on longitudinal data analysis, and towards the end of the course, we introduced generalised linear mixed models (GLMMs).

Loosely speaking, GLMMs generalise linear mixed models (LMMs) in the same way that generalised linear models (GLMs) extend the standard linear regression model: by allowing to model outcomes that follow distributions other than the gaussian, such as Poisson (e.g., for count data) or binomial (e.g., for binary outcomes – looking at you, *logistic regression*). Using proper notation, in GLMMs the response for the i^{th} subject at the j^{th} measurement occasion, , is assumed to follow any distribution from the exponential family such that where is a known link function and are the subject-specific random effects, assumed to follow a multivariate distribution with zero mean and variance-covariance matrix , . Given that random effects are subject-specific, responses within each individual will be correlated.

Let’s go one step forward by defining the GLMM-equivalent of a logistic regression model, where the link function is a logistic function, and assuming a random intercept only (for simplicity): Here can represent any set of *fixed effects* covariates, such as a treatment assignment, time, or something like that, a vector of regression coefficients, and is univariately normal. Note that interpretation of the fixed effects is similar to that of regression coefficients from a logistic regression model, as they can be interpreted as log odds ratios. The main difference, however, is that this interpretation is *conditional on random effects being set to zero* – loosely speaking, this interpretation holds for an *average* subject (in terms of random effects). It is fundamentally important to keep this in mind when interpreting the results of LMMs and GLMMs!

To estimate a GLMM, we can use the maximum likelihood approach. The joint probability density function of is given by where the random effect components are however latent (e.g., unobserved and unobservable). In order to calculate the individual contributions to the likelihood , we thus need to integrate out the density of the random effects: …which unfortunately does not have a closed, analytical form for GLMMs.

Luckily, several methods have been proposed throughout the years to approximate the value of integrals that do not have a closed-form; this is generally referred to as *numerical integration*.

Now, this is where we go off a tangent; buckle up, boys.

“[…] numerical integration comprises a broad family of algorithms for calculating the numerical value of a definite integral.”

As I mentioned above, the basic problem that numerical integration aims to solve is to approximate the value of a definite integral such as to a given degree of accuracy.

The term *numerical integration* first appeared in 1915 (who knew?!) and, as you can imagine, a plethora of approaches have been proposed throughout the years to solve this problem.

Several approaches are based on deriving interpolating functions that are easy to integrate, such as polynomials of low degree (e.g., linear or quadratic). The simplest approach of this kind is the so-called *rectangle rule* (or *midpoint rule*), where the interpolating function is a constant function (i.e. a polynomial of degree zero) that passes through the midpoint: Of course, the smart thing about this approach is that we can divide the integral into a large number of sub-intervals, thus increasing the accuracy of the approximation. Let’s illustrate this by assuming that we are trying to integrate a standard normal distribution (the `dnorm()`

function in R), e.g. . With five sub-intervals, the approximation would look like:

And the approximated integral would be 1.59684 (the true value is one, given that we are integrating a distribution). With more sub-intervals, e.g. 15, the accuracy of this approximation improves:

Now, the approximated integral is 1.00003, pretty much spot on. We can test how many sub-intervals are reguired to get a good approximation:

We can see that by using approximately 8-10 sub-intervals or more we can get a (very!) good approximation. Now, intuitively this works so well because we are trying to integrate a simple function; what if we had a more complex function, such as that we want to integrate between zero and 20?

As you can see from the plot above, the performance here is much worse than before.

The simple rectangle rule can now be improved by using more complex interpolating functions (e.g., using a polynomial of degree one, leading to a *trapezoidal rule*, or a polynomial of degree two, leading to a *Simpson’s rule*). We will however jump a million steps ahead (…sorry! The Wikipedia article on numerical integration includes a bunch more details if you want to read more about this), and talk about more elaborate numerical integration methods that are actually used when estimating the likelihood function of a GLMM.

Specifically, two approaches are routinely used, the first of which is Laplace approximation. The Laplace approximation uses a second-order Taylor series expansion, based on re-writing the intractable integrand from the likelihood contributions, which allows deriving closed-form expressions for the approximation. This is generally the fastest approach, and it can be shown that the approximation is asymptotically exact for an increasing number of observations per random effect. See, e.g., here for more details on the Laplace approximation in the settings of GLMMs.

Nevertheless, in practical applications, the accuracy of the Laplace approximation may still be of concern. Thus, a second approach for numerical integration is often used to obtain more accurate approximations (while however being more computationally demanding): Gaussian quadrature.

A quadrature rule is an approximation of a definite integral of a function that is stated as a weighted sum of function values at specified points within the domain of integration. For instance, when integrating a function over the domain , such rule (using points) is defined as: This rule is *exact* for polynomials of degree or less. For our specific problem, the domain of integration is (the domain of the normal distribution of the random effects); this leads to a so-called Gauss-Hermite quadrature rule, which given the normal distribution that we are trying to approximate, has optimal properties. Specifically, Gauss-Hermite integration with a function kernel of for a normal distribution with mean and variance , leads to the following rule: with the density of a standard normal distribution (such as `dnorm()`

in R).

Let’s get back to the example that we used above, integrating a standard normal distribution, to visualise Gauss-Hermite quadrature. This can be visualised, for 5-points quadrature, as:

For 11-points quadrature:

Testing convergence of this procedure:

As before, already with 7 quadrature points we get a good approximation of the density. Anecdotally, a larger number of quadrature points might be required to get a good approximation in practice, which would require more function evaluations and therefore a larger computational burden. Furthermore, this problem becomes exponentially more complex in higher dimensions, e.g., with more than one random effect: a univariate quadrature rule with 5 points requires 5 function evaluations, while a bivariate one with the same number of points requires 5^{2}=25 evaluations every single time the intractable integral is evaluated. Interestingly, Pinheiro and Bates showed that it is possible to *adapt* the integration procedure to be more computationally efficient in higher dimensions, by centring and scaling the quadrature procedure using conditional moments of the random effects distribution.

Now, we arrived at the end of this detour in the world of numerical integration; there are many more details that I skipped here, but the references linked above should provide a good starting point for the *interested reader*. Let’s get back to business.

Remember that the individual contributions to the likelihood for a GLMM are intractable, thus requiring numerical integration:

By now you can probably appreciate that accuracy of this approximation is key, but in practical terms, what does this mean? In other words,

what happens if we fail to accurately approximate that integral?

Yes, my friends, it is finally time to get to the meat of this post, *a mere 2,000 words later*!

We can try to answer that question using statistical simulation. If you know me, you might have noticed that I think that simulations are a fantastic tool for learning about complex statistical concepts. What better than this to further prove the point, then?

Let’s start by describing the protocol of this simulation study. We use the ADEMP framework to structure the protocol (and if you’re not on the ADEMP bandwagon, jump on, there’s still some space left).

The aim of this simulation study is to test the accuracy of different numerical integration methods in terms of estimation of fixed and random effects in GLMMs.

We use a single data generating mechanism for simplicity. We simulate binary outcomes for the i^{th} subject, j^{th} occasion from the following data-generating model:

Note that:

We include a binary treatment which is assigned a random at baseline by drawing from a Bernoulli random variable with a success probability of 0.5;

Time between each measurement, in years, for each subject, is simulated by drawing from a Uniform(0, 1) random variable;

I allow for a maximum of 100 measurements per subject, and truncate follow-up after 10 years;

is a random intercept, simulated by drawing a subject-specific value from a normal distribution with mean zero and standard deviation = 4;

The regression parameters are assigned the values -2, -1, 0.5, and -0.1 for , , , and , respectively.

Finally, every simulated dataset includes 500 subjects.

The estimands of interest are the regression parameters and the standard deviation of the random intercept .

We fit the true, data-generating GLMM as implemented in the `glmer()`

function from the {lme4} package in R, where we vary, however, the numerical integration method. In `glmer()`

, this is defined by the `nAGQ`

argument. We use the following methods for numerical integration:

`nAGQ = 0`

, corresponding to a faster but less exact form of parameter estimation for GLMMs by optimizing the random effects and the fixed-effects coefficients in the penalised iteratively reweighted least-squares step;`nAGQ = 1`

, corresponding to the Laplace approximation. This is the default for`glmer()`

;`nAGQ = k`

, with`k`

number of points used for the adaptive Gauss-Hermite quadrature method. I test values of`k`

from 2 to 10, and then from 15 to 50 at steps of 5, for a total of 17 possible values of`k`

. Remember that, as the`glmer()`

documentation points out,*larger values of*.`k`

produce greater accuracy in the evaluation of the log-likelihood at the expense of speed

In total, 19 models are fit to each simulated dataset and compared with this simulation study.

The main performance measure of interest is bias in the estimands of interest, to test the accuracy of the different numerical integration methods.

An important issue to keep in mind when running simulation studies is that of Monte Carlo error. Loosely speaking, there is a certain amount of randomness in the simulation and therefore uncertainty in the estimation of the performance measures of interest, thus we have to make sure that we can estimate them accurately (e.g., with low Monte Carlo error). This can be done iteratively, e.g., by running repetitions until the Monte Carlo error drops below a certain threshold (see here for more details).

For simplicity, we take a different approach here. What we do is the following:

Running 20 repetitions of this simulation study;

Using these 20 repetitions to estimate empirical standard errors and average model-based standard errors, for each estimand, and taking the largest value (denoted with );

Using the value that was just estimated to estimate how many repetitions would be needed to constrain Monte Carlo error to be less than 0.01, using the formula n

_{sim}= ;Rounding up the value of n

_{sim}to the nearest 100, to be conservative.

n_{sim} was finally estimated to be 1700, in this case, given .

At last, here are the results of this simulation study.

First, we assess whether all Monte Carlo errors for bias are below the threshold of 0.01. The plot below shows that the number of repetitions of this simulation study was enough to constrain Monte Carlo error within what we deemed acceptable.

Note that `(Intercept)`

corresponds to , `trt`

corresponds to , `time`

corresponds to , and `trt:time`

corresponds to . Unsurprisingly, `sd__(Intercept)`

represents the standard deviation of the random intercept, .

Second, we study bias for the fixed effects (e.g., the regression coefficients of the GLMM). The following plot depicts bias with 95% confidence intervals (based on Monte Carlo standard errors) for all fixed effects and across all integration methods:

This shows that the *fast-but-approximate* method yields biased results, and that so does the Laplace approximation (in these settings, and except for the treatment by time interaction term). Overall, this shows that a larger number of points for the adaptive quadrature (e.g., 10) is required to obtain unbiased results for all regression coefficients.

Finally, the next plot shows bias for the standard deviation of the random intercept:

Again, we can see that a larger number of quadrature points is required to get a good approximation with no bias.

Another thing that is interesting to study here is estimation time. As mentioned many times before, the larger the number of quadrature points the greater the accuracy, at the cost of additional computational complexity. But how much overhead do we have in these settings?

I’m glad you asked: the following plot depicts the distribution of estimation times for each GLMMs under each integration approach:

The first thing to note is that the *fast-but-approximate* method is actually *really fast*! Second, and as expected, with more quadrature points the median estimation time was also larger. No surprises so far. Overall, though, `glmer()`

seemed to be really fast in the settings, with estimation times rarely exceeding 25 seconds.

Before we wrap up, let’s compare the different methods in practice by analysing a real dataset. For this, we use data from the 1989 Bangladesh fertility survey, which can be obtained directly from Stata:

```
library(haven)
library(dplyr)
bangladesh <- read_dta("http://www.stata-press.com/data/r17/bangladesh.dta")
glimpse(bangladesh)
```

```
Rows: 1,934
Columns: 8
$ district <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ c_use <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, …
$ urban <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ age <dbl> 18.440001, -5.559990, 1.440001, 8.440001, -13.559900, -11.559…
$ child1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0…
$ child2 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ child3 <dbl> 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0…
$ children <dbl+lbl> 3, 0, 2, 3, 0, 0, 3, 3, 1, 3, 0, 0, 1, 3, 3, 3, 0, 3, 1, …
```

We will fit a GLMM model for a binary outcome trying to study contraceptive use by urban residence, age and number of children; we also include a random intercept by district.

For illustration purposes, we fit three models:

GLMM estimated with the

*fast-but-approximate*(`nAGQ = 0`

) approach;GLMM estimated with the Laplace approximation (

`nAGQ = 1`

);GLMM estimated with adaptive Gauss-Hermite quadrature using 50 points (

`nAGQ = 50`

).

The models can be fit with the following code:

```
library(lme4)
f.0 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 0)
f.1 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 1)
f.50 <- glmer(c_use ~ urban + age + child1 + child2 + child3 + (1 | district), family = binomial(), data = bangladesh, nAGQ = 50)
```

…and we use the {broom.mixed} package to tidy, summarise, and compare the results from the three models:

```
library(broom.mixed)
library(ggplot2)
bind_rows(
tidy(f.0, conf.int = TRUE) %>% mutate(nAGQ = "f.0"),
tidy(f.1, conf.int = TRUE) %>% mutate(nAGQ = "f.1"),
tidy(f.50, conf.int = TRUE) %>% mutate(nAGQ = "f.50")
) %>%
ggplot(aes(x = nAGQ, y = estimate)) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 1 / 3) +
geom_point() +
scale_y_continuous(n.breaks = 5) +
theme_bw(base_size = 12) +
facet_wrap(~term, scales = "free_y") +
labs(x = "Method (nAGQ)", y = "Point Estimate (95% C.I.)")
```

Model coefficients from the three integration approaches are very similar in this case, thus we can be confident that the results of this analysis are likely not affected by that. Let’s thus print the summary of the model using adaptive quadrature:

`summary(f.50)`

```
Generalized linear mixed model fit by maximum likelihood (Adaptive
Gauss-Hermite Quadrature, nAGQ = 50) [glmerMod]
Family: binomial ( logit )
Formula: c_use ~ urban + age + child1 + child2 + child3 + (1 | district)
Data: bangladesh
AIC BIC logLik deviance df.resid
2427.7 2466.6 -1206.8 2413.7 1927
Scaled residuals:
Min 1Q Median 3Q Max
-1.8140 -0.7662 -0.5087 0.9973 2.7239
Random effects:
Groups Name Variance Std.Dev.
district (Intercept) 0.2156 0.4643
Number of obs: 1934, groups: district, 60
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.689295 0.147759 -11.433 < 2e-16 ***
urban 0.732285 0.119486 6.129 8.86e-10 ***
age -0.026498 0.007892 -3.358 0.000786 ***
child1 1.116005 0.158092 7.059 1.67e-12 ***
child2 1.365893 0.174669 7.820 5.29e-15 ***
child3 1.344030 0.179655 7.481 7.37e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) urban age child1 child2
urban -0.288
age 0.449 -0.044
child1 -0.588 0.054 -0.210
child2 -0.634 0.090 -0.382 0.489
child3 -0.751 0.093 -0.675 0.538 0.623
```

Furthermore, we can interpret exponentiated fixed effect coefficients as odds ratios:

`exp(fixef(f.50))`

```
(Intercept) urban age child1 child2 child3
0.1846497 2.0798277 0.9738498 3.0526345 3.9192232 3.8344665
```

This shows that contraceptive use is higher in urban versus rural areas, with double the odds, that older women have lower odds of using contraceptives, and that the odds of using contraceptives are much higher in women with one or more children (odds ratios between 3 and 4 for women with 1, 2, or 3 children versus women with no children) – all else being equal. Nonetheless, remember that these results are conditional on a random intercept of zero, e.g., loosely speaking, for *an average district*.

Finally, there is also significant heterogeneity in contraceptive use between districts, with a variance for the random intercept of 0.2156. The distribution of the (log-) baseline odds of contraceptive use across districts can be visualised as:

```
ggplot(data = coef(f.50)$district, aes(x = `(Intercept)`)) +
geom_density() +
theme_bw(base_size = 12) +
labs(x = "Baseline log-odds of contraceptive use", y = "Density")
```

In conclusion, we saw how easy it is to get biased fixed effect estimates in GLMMs if we use a numerical integration method that is not accurate enough. Of course, this is just a single scenario and not a comprehensive study of estimation methods for GLMMs, but it should nonetheless show the importance of considering the estimation procedure when using (somewhat) more advanced statistical methods. This is, unfortunately, often overlooked in practice. My *friendly* suggestion would be to fit multiple models with different integration techniques, and if no significant differences are observed, then great: you got yourself some results that are robust to the numerical integration method being used.

This wraps up the longest blog post I’ve ever written (at ~3,500 words): if you made it this far, thanks for reading, and I hope you found this educational. And as always, please do get in touch if I got something wrong here.

Finally, R code for the simulation study, if you want to replicate the analysis or adapt it to other settings, is of course freely available on a GitHub repository here.

Cheers!

Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.

— Zoë Schiffer (@ZoeSchiffer) May 7, 2022

I have had this conversation with peers and co-workers over and over again in the past few months: there is no going back to pre-pandemic life. And while I value *in-person* activities (looking at you, in-person research seminars), I am totally sure that organisations failing to accommodate what people want will lose talent to more flexible workplaces.

*Update: as of May 18, 2022, it looks like he is joining Alphabet’s DeepMind.* *What a loss for Apple.*

Today’s blog post is a long time in the making, as I have been playing around with what we’re going to see today for quite a while now.

Let’s start with {torch}: what is that? Well, {torch} is an R package wrapping the `libtorch`

C++ library underlying the PyTorch open-source machine learning framework. It provides a variety of tools for developing machine learning methods, but there’s more: what we will focus on here is automatic differentiation and general-purpose optimisers.

Having these tools at our disposal lets us implement maximum likelihood estimation with state of the art tools. I will illustrate this using a simple parametric survival model, but as you can imagine, this generalises to more complex methods.

We will optimise an exponential survival model, for simplicity, whose log-likelihood function can be written as

Here is the event indicator variable, is the model design matrix, are regression coefficients, and is the observed time. Note that includes an intercept, which corresponds to the rate parameter on the log scale.

Other parametric distributions (such as Weibull, Gompertz, etc.) are equally easy to implement, let me know *if you fancy trying this out for yourself*!

Let’s start by simulating some data. To simulate survival data from a given parametric distribution, I use the inversion method as described in Bender *et al*., assuming a single binary covariate (e.g., a binary treatment):

```
set.seed(183475683) # For reproducibility
N <- 100000
lambda <- 0.2
beta <- -0.5
covs <- data.frame(id = seq(N), trt = stats::rbinom(N, 1L, 0.5))
# Inversion method for survival times:
u <- runif(N)
T <- -log(u) / (lambda * exp(covs$trt * beta))
```

We also apply administrative censoring at time 5:

```
d <- as.numeric(T <= 5)
T <- pmin(T, 5)
s1 <- data.frame(id = seq(N), eventtime = T, status = d)
dd <- merge(covs, s1)
```

We simulate data for 10^{5} subjects (which should be plenty enough to get pretty close to the truth if our implementation is correct), a rate parameter of 0.2 and a regression coefficient of -0.5. If interested, a more general implementation of this method can be found in the {simsurv} package.

As a test, let’s fit and plot a Kaplan-Meier survival curve:

```
library(survival)
KM <- survfit(Surv(eventtime, status) ~ 1, data = dd)
plot(KM, xlab = "Time", ylab = "Survival")
```

Looks alright!

Now, we implement the (log-) likelihood function using {torch}. The important thing to remember here is that {torch} uses tensors, on which we need to operate e.g. using `torch_multiply`

for matrix multiplication:

```
library(torch)
log_likelihood <- function(par, data, status, time) {
ll <- torch_multiply(status, torch_mm(data, par)) -
torch_multiply(torch_exp(torch_mm(data, par)), time) +
torch_multiply(status, torch_log(time))
ll <- -torch_sum(ll)
return(ll)
}
```

As a test, let’s define starting values for the model parameters (e.g., fixing their values at 1) and calculate the value of the (negative) log-likelihood function:

```
xx <- torch_tensor(matrix(c(1, 1), ncol = 1))
log_likelihood(
par = xx,
data = torch_tensor(data = model.matrix(~trt, data = dd)),
status = torch_tensor(matrix(dd$status, ncol = 1)),
time = torch_tensor(matrix(dd$eventtime, ncol = 1))
)
```

```
torch_tensor
1.71723e+06
[ CPUFloatType{} ]
```

Looking good so far.

The final step consists of implementing the algorithm to optimise the likelihood. We start by re-defining starting values:

`x_star <- torch_tensor(matrix(c(1, 1), ncol = 1), requires_grad = TRUE, )`

Here we need to use the argument `requires_grad = TRUE`

to use automatic differentiation and get gradients *for free*.

Next, we pick a general-purpose optimiser:

`optimizer <- optim_lbfgs(params = x_star, line_search_fn = "strong_wolfe")`

We pick the L-BFGS algorithm with strong Wolfe conditions for the line search algorithm, but any would do. Note that a comparable algorithm is implemented in base R as `optim()`

’s `L-BFGS-B`

method.

We also need to define one extra function that will be used in the optimisation loop to make each step towards the optimum:

```
one_step <- function() {
optimizer$zero_grad()
value <- log_likelihood(
par = x_star,
data = torch_tensor(data = model.matrix(~trt, data = dd)),
status = torch_tensor(matrix(dd$status, ncol = 1)),
time = torch_tensor(matrix(dd$eventtime, ncol = 1))
)
value$backward(retain_graph = TRUE)
value
}
```

We finally have all the bits to actually optimise the likelihood.

We define the required precision as `eps = 1e-6`

, and we loop until the difference in log-likelihood is less than (or equal to) `eps`

:

```
eps <- 1e-6 # Precision
converged <- FALSE # Used to stop the loop
last_val <- Inf # Need a value to compare to for the first iteration
i <- 0 # Iterations counter
while (!converged) {
i <- i + 1
obj_val <- optimizer$step(one_step)
if (as.logical(torch_less_equal(torch_abs(obj_val - last_val), eps))) {
print(i) # This will print how many iterations were required before stopping
converged <- TRUE
break
}
if (i >= 10000) {
# For safety
stop("Did not converge after 10000 iterations", call. = FALSE)
}
last_val <- obj_val
}
```

`[1] 3`

That’s it! The results of the optimisation are contained in the `x_star`

object:

`x_star`

```
torch_tensor
-1.6160
-0.4951
[ CPUFloatType{2,1} ][ requires_grad = TRUE ]
```

…remember that the true values that we simulated data from were:

`log(lambda)`

`[1] -1.609438`

```
# and
beta
```

`[1] -0.5`

Which is pretty close to what we had estimated. Of course, this is a single replication only, and we might want to test this with smaller sample sizes. Nevertheless, the test sample size is large enough that I would feel comfortable with this implementation.

One thing that is missing from the implementation above is the estimation of confidence intervals for the model parameters.

We get the gradients *for free*, so that should be straightforward after inverting the Hessian matrix at the optimum. Despite that, the R interface does not implement (yet) direct calculation of the Hessian via the `torch.autograd.functional.hessian`

function so we need to work a little harder for that.

Specifically, we have to differentiate the gradients again to obtain the Hessian matrix:

```
ll <- log_likelihood(
par = x_star,
data = torch_tensor(data = model.matrix(~trt, data = dd)),
status = torch_tensor(matrix(dd$status, ncol = 1)),
time = torch_tensor(matrix(dd$eventtime, ncol = 1))
)
grad <- autograd_grad(ll, x_star, retain_graph = TRUE, create_graph = TRUE)[[1]]
# Using base R matrix here for simplicity
hess <- matrix(data = NA, nrow = length(x_star), ncol = length(x_star))
for (d in 1:length(grad)) {
hess[d, ] <- as_array(autograd_grad(grad[d], x_star, retain_graph = TRUE)[[1]])
}
```

The variance-covariance matrix for the model coefficients will now be the inverse of the Hessian:

`vcv <- solve(hess)`

To wrap up, the fitted model coefficients (with standard errors) will be:

```
results <- data.frame(beta = as_array(x_star), se = sqrt(diag(vcv)))
results
```

```
beta se
1 -1.6159527 0.005639768
2 -0.4950813 0.008709507
```

Hopefully, the {torch} package in R will soon port the automatic Hessian calculation, which will simplify things further.

Finally, for comparison, we fit the same model using the equivalent R implementation from the (experimental) {streg} package:

```
library(streg)
expfit <- streg(
formula = Surv(eventtime, status) ~ trt,
data = dd,
distribution = "exp"
)
summary(expfit)
```

```
Exponential regression -- log-relative hazard form
N. of subjects = 100000
N. of failures = 54174
Time at risk = 345727.5
Log likelihood = -131655.9
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.615311 0.005637 -286.54 <2e-16 ***
trt -0.495611 0.008707 -56.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

This also uses automatic differentiation (via {TMB}), but nevertheless… pretty close, isn’t it?

That’s it for today, and as always, thank you for reading and feel free to get in touch if I got something terribly wrong or if you just want to have a chat about it. Cheers!

The blog is back! And just in time to wrap up 2021.

Let’s get straight to business, shall we? Today we’re going to talk about ways of assessing whether your simulation study has *converged*. This is something that’s been on my mind for a while, and I finally got some focused time to write about it.

To do so, let’s first define what we mean with *converged*:

We define a simulation study as

convergedif:

- Estimation of our key performance measures is stable, and
- Monte Carlo error is small enough.

We are going to use the `MIsim`

dataset, which you should be familiar with by now if you’ve been here before, and which comes bundled with the {rsimsum} package:

```
library(rsimsum)
data("MIsim", package = "rsimsum")
head(MIsim)
```

```
# A tibble: 6 × 4
dataset method b se
<dbl> <chr> <dbl> <dbl>
1 1 CC 0.707 0.147
2 1 MI_T 0.684 0.126
3 1 MI_LOGT 0.712 0.141
4 2 CC 0.349 0.160
5 2 MI_T 0.406 0.141
6 2 MI_LOGT 0.429 0.136
```

To find out more about this data, type `?MIsim`

in your R console after loading the {rsimsum} package.

Interestingly, this dataset already includes a column named `dataset`

indexing each repetition of the simulation study. If that was not the case, we should create such a column at this stage.

Let’s also load the {tidyverse} package for data wrangling and visualisation with {ggplot2}:

`library(tidyverse)`

The first step consists of computing performance measures *cumulatively*, e.g. for the first `i`

repetitions (where `i`

goes from 10 to the total number of repetitions that we run for our study, in this case, 1000). We use 10 here as the starting point as we assume that we should run at least 10 repetitions to get any useful result.

This can be easily done using our good ol’ friend the `simsum`

function and `map_`

from the {purrr} package; specifically, we use `map_dfr()`

to get a dataset obtained by stacking rows instead of a list object. Notice also that we only focus on *bias* as the performance measure of interest here; this could be done, in principle, for any other performance measure.

```
full_results <- map_dfr(
.x = 10:1000,
.f = function(i) {
s <- simsum(
data = filter(MIsim, dataset <= i),
estvarname = "b",
se = "se",
true = 0.5,
methodvar = "method",
ref = "CC"
)
s <- summary(s)
results <- rsimsum:::tidy.summary.simsum(s)
results <- filter(results, stat == "bias") %>%
mutate(i = i)
return(results)
}
)
head(full_results)
```

```
stat est mcse method lower upper i
1 bias 0.006621292 0.04267397 CC -0.07701815 0.09026074 10
2 bias 0.016931728 0.03830584 MI_LOGT -0.05814635 0.09200980 10
3 bias 0.008187965 0.03047714 MI_T -0.05154612 0.06792205 10
4 bias 0.009806267 0.03873124 CC -0.06610556 0.08571810 11
5 bias 0.021526182 0.03495223 MI_LOGT -0.04697892 0.09003129 11
6 bias 0.016194059 0.02870663 MI_T -0.04006990 0.07245802 11
```

We can use these results to plot estimated bias (and corresponding Monte Carlo standard errors) over repetition numbers:

```
library(hrbrthemes) # Provide some nicer themes and colour scales
ggplot(full_results, aes(x = i, y = est, color = method)) +
geom_line() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Estimated Bias", color = "Method", title = "Bias over (cumulative) repetition number")
```

```
ggplot(full_results, aes(x = i, y = mcse, color = method)) +
geom_line() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Estimated Monte Carlo S.E.", color = "Method", title = "Monte Carlo S.E. over (cumulative) repetition number")
```

What shall we be looking for in these plots?

Basically, if (and when) the curves *flatten*. Remember *flatten the curve*? That applies here as well.

For Monte Carlo standard errors, we want to also check when they reach a threshold of uncertainty (e.g. 0.01) that we are willing to accept. We can also add this threshold to the plot to help our intuition:

```
ggplot(full_results, aes(x = i, y = mcse, color = method)) +
geom_hline(yintercept = 0.01, color = "red", linetype = "dotted") +
geom_line() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Estimated Monte Carlo S.E.", color = "Method", title = "Monte Carlo S.E. over (cumulative) repetition number")
```

The plots above seem to be stable enough after ~500 repetitions, while Monte Carlo errors cross our threshold after ~250 repetitions. If I had to interpret this, I would be satisfied with the convergence of the study.

The dataset with results (`full_results`

) includes confidence intervals for estimated bias, at each (cumulative) repetition number, based on Monte Carlo standard errors. We can, therefore, further enhance the first plot introduced above:

```
ggplot(full_results, aes(x = i, y = est)) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = method), alpha = 1 / 5) +
geom_line(aes(color = method)) +
scale_fill_ipsum() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Estimated Bias", color = "Method", fill = "Method")
```

Adding confidence intervals surely enhances our perception of *stability*.

Now, we could stop here and call it a day, but where’s the fun in that? Let’s take it to the next level.

I think that an even better way of assessing convergence is to check whether *the incremental value of an additional repetition* affects the results of the study (or not). When running additional repetitions stop adding value (e.g. changing the results), then we can (safely?) assume we have converged to stable results.

Let’s, therefore, calculate this difference (e.g. bias at i^{th} iteration versus bias at the (i-1)^{th} iteration), and let’s do it for both estimated bias and Monte Carlo standard error:

```
full_results <- arrange(full_results, method, i) %>%
group_by(method) %>%
mutate(
lag_est = lag(est),
lag_mcse = lag(mcse),
diff_est = est - lag_est,
diff_mcse = mcse - lag_mcse
)
```

This *difference* is what we now decide to plot:

```
p1 <- ggplot(full_results, aes(x = i, y = diff_est, color = method)) +
geom_line() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Difference", color = "Method", title = expression(Bias ~ difference:~ i^th - (i - 1)^th))
p1
```

```
p2 <- ggplot(full_results, aes(x = i, y = diff_mcse, color = method)) +
geom_line() +
scale_color_ipsum() +
theme_ipsum_rc(base_size = 12) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Repetition #", y = "Difference", color = "Method", title = expression(Monte ~ Carlo ~ S.E. ~ difference:~ i^th - (i - 1)^th))
p2
```

We save both plot objects `p1`

and `p2`

, as we might be using that again later on. What we want here are the curves to flatten around zero, which seems to be the case for this example.

The y-scale of the plot is highly influenced by larger differences early on, so we could decide to focus on a narrower range around zero, e.g. for point estimates:

`p1 + coord_cartesian(ylim = c(-0.003, 0.003))`

With this, we can confirm what we suspected a few plots ago: that *yes, the incremental value of running more than ~500 repetitions is somewhat marginal here*.

To wrap up, there is one final thing I would like to mention here: all we did is *post-hoc*. We already run the study, hence all we get to do is to check whether the results are stable and precise enough (in our loose terms defined above).

The implication is an interesting one, in my opinion: you *do not* need a bunch of iterations if you can avoid it, especially if each repetition is expensive (computationally speaking).

What you could do, if you really wanted to, is to define a *stopping rule* such as:

After N (e.g. 10) consecutive iterations with a difference below a certain threshold (for the key performance measure), stop the study;

After you reach a given precision for bias (in terms of Monte Carlo error), stop the study;

You name it.

Of course, there are other things to consider such as the sequentiality of repetitions, especially if there are missing values (e.g. due to non-convergence of some iterations) – this is not meant to be, in any way, a comprehensive take on the topic. Feel free to reach out, as always, if you have any comments. Nevertheless, I think I will get back to this topic eventually, so stay tuned for that.

That’s all from me for today, then it must be closing time: talk to you soon and, in the meantime, take care!

]]>It’s been a while since the last post on this website… don’t worry (I am sure you didn’t), I’m still here, just been busy with a bunch of life- and work-related things.

This post is to introduce the latest release of the {rsimsum} R package, version 0.11.0, which landed on CRAN last week, on October 20^{th}.

This is a minor release, with some bug fixes and (more interestingly) the introduction of a new feature that was suggested in #22 on GitHub by Li Ge: `print()`

methods for summary objects now invisibly return the formatted tables that are printed to the console.

Ok, but why should I care about that?

It’s simple: you can print subset of results (e.g. for a presentation) much more easily, as the internals of {rsimsum} will deal with some of the formatting for you.

Here’s an example, using the `MIsim`

dataset that comes bundled with the package:

```
library(rsimsum)
data("MIsim", package = "rsimsum")
```

We summarise this simulation study as showed in the documentation here:

```
s <- simsum(
data = MIsim,
estvarname = "b",
se = "se",
true = 0.5,
methodvar = "method",
ref = "CC"
)
sums <- summary(s, stats = c("bias", "cover"))
sums
```

```
Values are:
Point Estimate (Monte Carlo Standard Error)
Bias in point estimate:
CC MI_LOGT MI_T
0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)
Coverage of nominal 95% confidence interval:
CC MI_LOGT MI_T
0.9430 (0.0073) 0.9490 (0.0070) 0.9430 (0.0073)
```

This is the standard workflow, and we focus on bias and coverage probability for simplicity. At this point, we could copy-paste a subset of the above results in our slides-making tool of choice.

However, that’s not particularly user-friendly, nor easily reproducible (e.g. if you run more iterations and need to update your results). Here’s where invisibly-returned formatted tables come handy:

`output <- print(sums)`

```
Values are:
Point Estimate (Monte Carlo Standard Error)
Bias in point estimate:
CC MI_LOGT MI_T
0.0168 (0.0048) 0.0009 (0.0042) -0.0012 (0.0043)
Coverage of nominal 95% confidence interval:
CC MI_LOGT MI_T
0.9430 (0.0073) 0.9490 (0.0070) 0.9430 (0.0073)
```

Note here that the output of `sums`

is printed once again, but it is also stored in a variable named `output`

, which contains the formatted tables for each summary statistic of interest:

`str(output)`

```
List of 2
$ Bias in point estimate :'data.frame': 1 obs. of 3 variables:
..$ CC : chr "0.0168 (0.0048)"
..$ MI_LOGT: chr "0.0009 (0.0042)"
..$ MI_T : chr "-0.0012 (0.0043)"
$ Coverage of nominal 95% confidence interval:'data.frame': 1 obs. of 3 variables:
..$ CC : chr "0.9430 (0.0073)"
..$ MI_LOGT: chr "0.9490 (0.0070)"
..$ MI_T : chr "0.9430 (0.0073)"
```

With this data at our disposal, we can finally print a better table using the general-purpose `kable()`

function from the {knitr} package:

```
w <- output$`Bias in point estimate`
knitr::kable(
x = w,
align = rep("c", ncol(w))
)
```

CC | MI_LOGT | MI_T |
---|---|---|

0.0168 (0.0048) | 0.0009 (0.0042) | -0.0012 (0.0043) |

Of course we would need to further improve on this for *production* (e.g. it now spans the whole width of the enclosing `<div>`

, we might want to style it and resize it accordingly using css), but it is already a clear improvement. Most interestingly, consider including all of the above in a dynamically updated slides deck (e.g. created using the {xaringan} package): much better than before, with very little extra work, and plenty of room for further adjustments if you wish. Not bad!

Finally, the example above is for the `summary.simsum()`

method: this is implemented for multiple estimands as well (in the `summary.multisimsum()`

function), with an additional layer of nesting for the formatted output. I’m sure it’ll be straightforward to figure that out if you want to try it out, let me know if it isn’t.

That’s all for now, hope you find this useful and if you have further suggestions for features you’d like to see in {rsimsum}, don’t hesitate to get in touch or to open an issue on GitHub.

]]>Earlier today I stumbled upon this tweet by Sean J. Taylor:

Without doing the math or looking it up, approximately how many coin flips do you need to observe in order to be 95% confident you know Prob(Heads) to within +/- 1%?

— Sean J. Taylor (@seanjtaylor) April 22, 2021

…and I asked myself:

Can we answer this using simulation?

The answer is that *yes, yes we can*! Let’s see how we can do that using R (of course). Looks like this is now primarily a statistical simulation blog, but hey…

Let’s start by loading some packages and setting an RNG seed for reproducibility:

```
library(tidyverse)
library(hrbrthemes)
set.seed(3756)
```

Then, we define a function that we will use to replicate this experiment:

```
simfun <- function(i, .prob, .diff = 0.02, .N = 100000, .alpha = 0.05) {
draw <- rbinom(n = .N, size = 1, prob = .prob)
df <- tibble(
n = seq(.N),
x = cumsum(draw),
mean = x / n,
lower = (1 + (n - x + 1) / (x * qf(.alpha / 2, 2 * x, 2 * (n - x + 1))))^(-1),
upper = (1 + (n - x) / ((x + 1) * qf(1 - .alpha / 2, 2 * (x + 1), 2 * (n - x))))^(-1),
width = upper - lower
)
out <- tibble(i = i, prob = .prob, diff = .diff, y = min(which(df$width <= .diff)))
return(out)
}
```

This function:

Repeates the experiment for 1 to 100,000 (

`.N`

) draws from a binomial distribution (e.g. our coin toss), with success probability`.prob`

;Estimates the cumulative proportion of heads (our ones) across all number of draws;

Estimates confidence intervals using the exact formula;

Calculates the width of the confidence interval. For a +/- 1% precision, that corresponds to a width of 2% (or 0.02, depending on the scale being used);

Finally, returns the first number of draws where the width is 0.02 (or less). This will tell us how many draws we needed to draw to get a confidence interval that is narrow enough for our purpose.

Then, we run the experiment `B = 200`

times, with different values of `.prob`

(as we want to show the required sample sizes over different success probabilities):

```
B <- 200
results <- map_dfr(.x = seq(B), .f = function(j) {
simfun(i = j, .prob = runif(1))
})
```

Let’s plot the results (using a LOESS smoother) versus the success probability:

```
ggplot(results, aes(x = prob, y = y)) +
geom_point() +
geom_smooth(method = "loess", size = 1, color = "#575FCF") +
scale_x_continuous(labels = scales::percent) +
scale_y_continuous(labels = scales::comma) +
coord_cartesian(xlim = c(0, 1)) +
theme_ipsum(base_size = 12, base_family = "Inconsolata") +
theme(plot.margin = unit(rep(1, 4), "lines")) +
labs(x = "Success probability", y = "Required sample size")
```

Looks like — as one would expect — the largest sample size required is for a success probability of 50%. What is the solution to the riddle then? As any statistician would confirm, *it depends*: if the coin is *fair*, then we would need approximately 10,000 tosses. Otherwise, the required sample size can be much smaller, even a tenth. Cool!

Finally, let’s repeat the experiment for a precision of +/- 3%:

```
results_3p <- map_dfr(.x = seq(B), .f = function(j) {
simfun(i = j, .diff = 0.06, .prob = runif(1))
})
```

Comparing with the previous results (code to produce the plot omitted for simplicity):

For a required precision of +/- 3% the sample size that we would need is much smaller, but the shape of the curve (i.e. versus the success probability) remains the same.

Finally, the elephant in the room: of course we could answer this using our old friend *mathematics*, or with a bit of Bayesian thinking. But where’s the fun in that? This simulation approach lets us play around with parameters and assumptions (e.g. what would happen if we assume a difference confidence level `.alpha`

?), and it’s quite intuitive too. And yeah, let’s be honest: programming and running the experiment is just *much* more fun!

*As always, please do point out all my errors on Twitter. Cheers!*

Yes, you read it right, you’re not dreaming: statistical simulation is now cool!

As you might now, I am also a strong supporter of statistical simulation. Traditionally you might use it to test a new statistical method you are developing, or to compare different methods in unusual settings. You might also want to try to see where, when, and if a method breaks in edge cases.

However, two other (very important) use-cases are highlighted in the talks that I mentioned above:

Simulation for learning and teaching,

Simulation to drive agile infrastructures.

The first scenario is described very well by Chelsea Parlett-Pelleriti:

There are three main take-home messages:

Statistical simulation encourages exploration,

Tests intuition,

And empowers a deeper understanding of complex statistical methods.

I stand by all of these points.

In fact, that’s a great and accessible way to learn: by trying, and trying, and trying again until you finally get it. And if you follow a simulation exercise, it’s even easier: you can modify the parameters at will (starting from a solid foundation), explore, and follow your intuition. I mean, isn’t this the *scientific method* all along?

The second talk is by Richard Vogg:

During the talk, he gives a couple of examples that highlight the importance of being able to *compose data* at will:

When you cannot share the real dataset (e.g. for privacy reason), it is useful to have

*real-like*data that can be shared more freely. There’s a decent amount of research (and growing interest) on the topic, see e.g. this paper by Dan Quintana;When you don’t have the data you’re supposed to be building pipelines for (yet), it is useful to have data that is similar to what you expect receiving. That gives you a head start to start e.g. prototyping;

When you’re running internal training courses for staff, it is useful to use datasets that resemble what you actually work with (e.g. transactions, clients, etc.).

…the more you think about it, the more you realise how useful that can be!

In conclusion, both talks are excellent, straight to the point, and engaging. I recommend you give it a look, it will (literally) take just 10 minutes of your time.

See you soon!

]]>This made me think about mixed cure fraction models, that is a model of the kind:

with being the proportion cured and the survival function for the uncured subjects. Assuming follows an exponential distribution (for simplicity and from now onward), this corresponds to a survival curve like the following:

where . With this model, the cure fraction creates an asymptote for the survival function, as % of subjects are deemed to be *cured*; of course, can be estimated from data. Interpretation of mixed cure fraction models is awkward, as one is assuming that subjects are split into *cured* and *uncured* at the beginning of the follow-up (which might not be the most realistic assumption), but discussing cure models is outside the scope of this blog post (and there are many papers you could find on the topic).

Back to the question, how to simulate from this survival function?

Well, as with most things these days, we can use the inversion method! First, we actually have to define which individuals are cured and which are not. To do so, we draw from a Bernoulli distribution with success probability :

```
set.seed(4875) # for reproducibility
n <- 100000
pi <- 0.2
cured <- rbinom(n = n, size = 1, prob = pi)
prop.table(table(cured))
```

```
cured
0 1
0.80074 0.19926
```

Then, we simulate survival times e.g. from an exponential distribution (with ). We use the formulae described in Bender *et al.* (which is in fact the inversion method applied to survival functions):

```
lambda <- 0.5
u <- runif(n = n)
S <- -log(u) / lambda
```

Finally, we assign to cured individuals an infinite survival time (as they will not experience the event ever):

`S[cured == 1] <- Inf`

Of course, now we are required to censor individuals, say after 20 years of follow-up:

```
status <- S <= 20
S <- pmin(S, 20)
df <- data.frame(S, status)
```

…and there you have it, it’s done.

What we want to do next is checking that we are actually simulating from the model we think we are simulating from. Let’s first check by fitting a Kaplan-Meier curve:

```
library(survival)
KM <- survfit(Surv(S, status) ~ 1, data = df)
plot(KM)
abline(h = pi, col = "red")
```

…which looks similar to the theoretical curve depicted above. Good!

Then, we could try to fit the data-generating model using maximum likelihood. The density function of the cure model is defined as

where and is the density function for the non-cured subjects (e.g. from the exponential distribution, from our example above). Therefore, the likelihood contribution for the i^{th} subject can be defined as:

In R, that (actually, the log-likelihood) can be defined as:

```
log_likelihood <- function(par, t, d) {
lambda <- exp(par[1])
pi <- boot::inv.logit(par[2])
f <- (1 - pi) * lambda * exp(-lambda * t)
S <- pi + (1 - pi) * exp(-lambda * t)
L <- (f^d) * (S^(1 - d))
l <- log(L)
return(-sum(l))
}
```

Some comments on the above function:

I chose to model the parameter on the log-scale, and on the logit scale. By doing so, we don’t have to constrain the optimisation routine to respect the boundaries of the parameters, as and ;

`f`

contains the density function of an exponential distribution (which is );`S`

contains the survival function of an exponential distribution;I take the log of the likelihood function (which generally behaves better when optimising numerically) and return the negative log-likelihood value (as R’s

`optim`

minimises the objective function by default).

Now, all we have to do is define some starting values `stpar`

and then run `optim`

.

```
stpar <- c(ln_lambda = 0, logit_pi = 0)
optim(par = stpar, fn = log_likelihood, t = df$S, d = df$status, method = "L-BFGS-B")
```

```
$par
ln_lambda logit_pi
-0.6945243 -1.3909736
$value
[1] 185584.3
$counts
function gradient
8 8
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
```

The true values were:

`log(lambda)`

`[1] -0.6931472`

`boot::logit(pi)`

`[1] -1.386294`

…which (once again) are close enough to the fitted values. It does seem like we are actually simulating from the mixture cure model described above.

In conclusion, obviously this might not fit the actual experiment too well and therefore we might have to try other approaches before considering our work done. For instance, we could simulate from an hazard function that goes to zero after a given amount of time : this approach can easily be implemented using the {simsurv} package. *The proof is left as an exercise for the interested reader*.

Cheers!

]]>Reviews just started coming out, and everyone seems to be praising the performance of both optimised and translated software running on these new low-powered devices. Synthetic benchmarks and real-life tests are matching the performance of e.g. a MacBook Air to a 2019 16” MacBook Pro or even a 2017 iMac Pro. Which is mad!

As an R user, should you just run out and buy a new macOS device with Apple Silicon? Well, not so fast, tiger! A post on the R-developer blog dives deeper on the current status of R on Apple Silicon:

R seems to be running fine through the translation layer, but of course that is not optimised and performance

*should*be worse than running natively;R can already run on ARM32/ARM64 devices. Heck, it can even run on a Raspberry Pi Zero with a single-core 1 GHz CPU and 512 MB of RAM! However, before it can be compiled to Apple Silicon the whole stack of compilers will need to be updated/ported to the new architecture; work seems to be underway, so I believe it won’t take long.

There are inconsistencies with how

`NA`

/`NaN`

are handled and propagated, but that is platform-specific and it is already*complicated*in the x86/x64 world. Writing code that can reliably preserve`NA`

and`NaN`

values will require ad-hoc checks.

This is just my quick executive summary, read through the blog post above for more details.

If you know me, you know how much I like tiny, silent computers. I mean, the MacBook Adorable is my favourite macOS laptop I never got to own and I got *so* close to buying a Surface Go 2 when it was announced earlier this year. If I had to get a desktop computer, I would probably get an Intel NUC or build into a small Mini-ITX case. And don’t get me started on Raspberry Pi boards! You bet I am very excited about the idea of a next-generation, fanless MacBook Air with great performance and excellent battery life!

Another potentially big thing is the *Neural Engine* that is embedded into Apple Silicon chips. Will R be able to link to that dedicated chip to accelerate machine learning (and potentially other) computations? Time will tell…

Finally, my understanding is that tools like `brew`

and `git`

are currently broken(-ish) on Apple Silicon. I am sure they will be updated for the new architecture soon, though.

Nevertheless, I am happy that the Mac is now an exciting platform once again after years of stagnation. I look forward to seeing how this evolves in the coming months, and I am definitely not upgrading my current laptop any time soon.

There is now a thread on the R-SIG-Mac mailing list where Prof. Brian Ripley gives first impressions and benchmarks on building CRAN’s R 4.0.3 on a M1 MacBook Air with 8 GB of RAM.

First results seems very promising: building on top of Rosetta is actually faster than building on a 2016 2.0 GHz i5 MacBook Pro, but lots of work remains to be done to have a build ready for general use.

Furthermore, did you see those performance benchmarks for Mac-optimised TensorFlow training on Apple Silicon? I guess *yes*, machine learning computations will be greatly accelerated on new Apple hardware once fully optimised!

Automatic differentiation (in brief) is an algorithmic method that *automagically* and efficiently yields accurate derivatives of a given value. This is very interesting, as by using automatic differentiation we can easily get gradients and the Hessian matrix. Which are extremely useful when fitting models using the maximum likelihood method! On top of that, automatic differentiation is generally more efficient and accurate than symbolic or numerical differentiation.

There’s a bunch of examples of using {TMB} in practice, some are very simple and straightforward to follow, some are much more complex. I am just getting started, so I went straight to the linear regression example on the TMB documentation webpage. The C++ template for that simple model is:

```
#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
DATA_VECTOR(Y);
DATA_VECTOR(x);
PARAMETER(a);
PARAMETER(b);
PARAMETER(logSigma);
ADREPORT(exp(2*logSigma));
Type nll = -sum(dnorm(Y, a+b*x, exp(logSigma), true));
return nll;
}
```

…which can be compiled from within R and passed to any general-purpose optimiser (such as `optim`

) to obtain maximum likelihood estimates for a linear regression model.

It is however interesting to generalise this to any number of covariates, using matrix-by-array multiplication to efficiently scale up any problem. This is relevant to implement general statistical modelling packages. Surprisingly, I had to fiddle around way more than I expected to do that: therefore, I thought about writing a blog post (which hopefully could be useful to others trying to get started with TMB)!

*Disclaimer:* I am no {TMB} nor C++ expert, so forgive me if I am missing something and let me know if there’s anything that needs fixing here.

So, let’s get started. We first need to write our C++ function for the negative log-likelihood function, which is a simple adaptation of the template we saw before for a single-covariate model:

```
#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
DATA_VECTOR(Y);
DATA_MATRIX(X);
PARAMETER_VECTOR(b);
PARAMETER(logSigma);
Type nll = sum(dnorm(Y, X*b , exp(logSigma), true));
return -nll;
}
```

Here we needed to change `X`

to a `DATA_MATRIX()`

, and `b`

to a `PARAMETER_VECTOR`

. Note that `X`

here should be the design matrix of the model, e.g. obtained using the `model.matrix()`

function in R, and that the matrix-array multiplication in C++ `X*b`

is *not* the element-wise operation, and is equivalent to `X %*% b`

in R. This is it: easy right?

Let’s now verify that this works as expected. We can compile the C++ function and dynamically link it using the tools from the {TMB} package (assuming `model.cpp`

contains the function defined above):

```
library(TMB)
compile("model.cpp")
dyn.load(dynlib("model"))
```

We need to simulate some data now, e.g. with a single covariate for simplicity:

```
set.seed(42)
n <- 1000
x <- rnorm(n = n, mean = 50, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 0
y <- b0 + b1 * x + e
```

We create a dataset and a model matrix too:

```
df <- data.frame(y, x)
X <- model.matrix(y ~ x, data = df)
head(X)
```

```
(Intercept) x
1 1 63.70958
2 1 44.35302
3 1 53.63128
4 1 56.32863
5 1 54.04268
6 1 48.93875
```

Now we can use the `MakeADFun()`

from {TMB} to construct the R object that brings all of this together:

```
f <- MakeADFun(
data = list(X = X, Y = y),
parameters = list(b = rep(0, ncol(X)), logSigma = 0),
silent = TRUE
)
```

This passes the data `X`

and `y`

and sets the default values of the model parameters to zeros for all regression coefficients (`b0`

and `b1`

in this case) and for the log of the standard deviation of the residual errors.

We can now easily get the value of the (negative) log-likelihood function at the default parameters, gradients, and the Hessian matrix:

`f$fn(x = f$par)`

`[1] 51351.44`

`f$gr(x = f$par)`

```
[,1] [,2] [,3]
[1,] -9994.682 -497251.6 -99865.01
```

`f$he(x = f$par)`

```
[,1] [,2] [,3]
[1,] 1000.00 49741.76 19989.36
[2,] 49741.76 2574646.64 994503.22
[3,] 19989.36 994503.22 201730.02
```

The cool part is, we didn’t even have to define analytical formulae for gradients and the Hessian and we got it *for free* with automatic differentiation! Now all we have to do is to pass the negative log-likelihood function and the gradients to e.g. `optim()`

and that’s all:

```
fit <- optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")
fit
```

```
$par
b b logSigma
9.945838718 0.000981937 -0.014587405
$value
[1] 1404.351
$counts
function gradient
35 35
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
```

If we exponentiate the value of `log(Sigma)`

, we obtain the standard deviation of the residuals on the proper scale: 0.9855. Compare this with the true values (`b0`

= 10, `b1`

= 0, `log(Sigma)`

= 0) and with the results of the least squares estimator:

`summary(lm(y ~ x, data = df))`

```
Call:
lm(formula = y ~ x, data = df)
Residuals:
Min 1Q Median 3Q Max
-2.9225 -0.6588 -0.0083 0.6628 3.5877
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.9458454 0.1579727 62.959 <2e-16 ***
x 0.0009818 0.0031133 0.315 0.753
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9865 on 998 degrees of freedom
Multiple R-squared: 9.964e-05, Adjusted R-squared: -0.0009023
F-statistic: 0.09945 on 1 and 998 DF, p-value: 0.7526
```

We are getting really close here!

Let’s now generalise this to multiple covariates, say 4 normally-distributed covariates:

```
set.seed(42)
n <- 1000
x1 <- rnorm(n = n, mean = 10, sd = 10)
x2 <- rnorm(n = n, mean = 20, sd = 10)
x3 <- rnorm(n = n, mean = 30, sd = 10)
x4 <- rnorm(n = n, mean = 40, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 1
b2 <- 2
b3 <- 3
b4 <- 4
y <- b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + e
df <- data.frame(y, x1, x2, x3, x4)
X <- model.matrix(y ~ x1 + x2 + x3 + x4, data = df)
head(X)
```

```
(Intercept) x1 x2 x3 x4
1 1 23.709584 43.25058 32.505781 33.14338
2 1 4.353018 25.24122 27.220760 32.07286
3 1 13.631284 29.70733 12.752643 35.92996
4 1 16.328626 23.76973 9.932951 28.51329
5 1 14.042683 10.04067 17.081917 51.15760
6 1 8.938755 14.02517 33.658382 31.20543
```

All we have to do is re-create the `f`

object using the `MakeADFun()`

function and the new data:

```
f <- MakeADFun(
data = list(X = X, Y = y),
parameters = list(b = rep(0, ncol(X)), logSigma = 0),
silent = TRUE
)
```

Let’s now fit this with `optim()`

and compare with the least squares estimator:

`optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")`

```
$par
b b b b b logSigma
9.90325169 0.99992371 2.00125295 2.99795882 4.00296201 0.01661648
$value
[1] 1435.576
$counts
function gradient
103 103
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
```

`summary(lm(y ~ x1 + x2 + x3 + x4, data = df))`

```
Call:
lm(formula = y ~ x1 + x2 + x3 + x4, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.3354 -0.7382 0.0161 0.6893 3.0315
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.903788 0.177893 55.67 <2e-16 ***
x1 0.999929 0.003218 310.72 <2e-16 ***
x2 2.001249 0.003277 610.69 <2e-16 ***
x3 2.997959 0.003134 956.53 <2e-16 ***
x4 4.002949 0.003266 1225.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9997
F-statistic: 7.294e+05 on 4 and 995 DF, p-value: < 2.2e-16
```

Very close once again, and if you noticed, *we didn’t have to change a single bit of code from the first example with a single covariate*. Isn’t that cool?

*I know, I know, maybe it’s cool for a specific type of person only, but hey…*

Another nice thing is that given that we have gradients (in compiled code!), maximising the likelihood will be faster and generally more accurate if using a method that relies on gradients, such as the limited-memory modification of the BFGS quasi-Newton method that we chose here. See the following benchmark as a comparison:

```
library(bench)
bench::mark(
"with gradients" = optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B"),
"without gradients" = optim(par = f$par, fn = f$fn, method = "L-BFGS-B"),
iterations = 20,
relative = TRUE,
check = FALSE
)
```

```
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 with gradients 1 1 3.85 1 NaN
2 without gradients 3.89 3.83 1 1 Inf
```

In this case (a small and simple toy problem) the optimisation using gradients is ~4 times faster (on my quad-core MacBook Pro with 16 Gb of RAM). And all of that for free, just using automatic differentiation!

Remember that, if not provided with a gradient function, `optim()`

will use finite differences to calculate the gradients numerically. Then, the benchmark above gives us a direct comparison with numerical differentiation too.

To wrap up, hats off to the creators of the {TMB} package for providing such an easy and powerful framework, and I didn’t even scratch the surface here: this is just a quick and dirty toy example with multiple linear regression, which doesn’t event needs maximum likelihood.

Anyway, I’m definitely going to come back to automatic differentiation and {TMB} in future blog posts, stay tuned for that. Cheers!

*Update:* Benjamin Christoffersen shared on Twitter a link to a thread on Cross Validated where the differences between maximum likelihood and least squares are discussed in more detail. It’s a very interesting read, remember to check it out!

Let’s get straight to business: this is a short post to announce that new releases of {rsimsum} and {KMunicate} just landed on CRAN! {rsimsum} is now at version 0.9.1, while {KMunicate} is now at version 0.1.0.

Both are mostly maintenance releases: some small bugs have been squashed, and new (hopefully useful) customisation options have been added to {KMunicate}; some typos in the documentation have been fixed too. If you want to read more, all details can be found in the NEWS files of {rsimsum} and {KMunicate}.

If you have either package already installed, just use `update.packages()`

to obtain the new version; otherwise, `install.packages("rsimsum")`

and `install.packages("KMunicate")`

will do the trick. As always, feedback on the new releases is very much appreciated.

That’s all: I promised this was going to be short, and less than 150 words later I believe I delivered. But don’t worry, I’ll be back soon (*-ish*) with a follow-up on my post on academic conferences in the year 2020 after a whole season of remote conferences. In the meanwhile, take care and be safe!

{KMunicate} is now on CRAN, and the development version lives on my GitHub profile. You can install the CRAN version as usual:

`install.packages("KMunicate")`

Alternatively, you can install the dev version of {KMunicate} from GitHub with:

```
# install.packages("devtools")
devtools::install_github("ellessenne/KMunicate-package")
```

KMunicate-style Kaplan-Meier plots include confidence intervals for each fitted curve and an extended table beneath the main plot including the number of individuals at risk at each time and the cumulative number of events and censoring events. Here’s an example from the KMunicate study itself:

As you might imagine, that’s quite a lot of work to produce a plot like that.

The {KMunicate} package lets you create such a plot with a *single line of code*. Isn’t that great?

Let’s illustrate the basic functionality of {KMunicate} with an example. We’ll be using once again data from the German breast cancer study, which is conveniently bundled with {KMunicate}:

```
data(brcancer, package = "KMunicate")
head(brcancer)
```

```
id hormon x1 x2 x3 x4 x5 x6 x7 rectime censrec x4a x4b x5e
1 1 0 70 2 21 2 3 48 66 1814 1 1 0 0.69767630
2 2 1 56 2 12 2 7 61 77 2018 1 1 0 0.43171051
3 3 1 58 2 35 2 9 52 271 712 1 1 0 0.33959553
4 4 1 59 2 17 2 4 60 29 1807 1 1 0 0.61878341
5 5 0 73 2 35 2 1 26 65 772 1 1 0 0.88692045
6 6 0 32 1 57 3 24 0 13 448 1 1 1 0.05613476
```

The survival time is in `rectime`

, and the event indicator variable is `censrec`

; the treatment variable is `hormon`

, a binary covariate.

First, we fit the survival curve by treatment arm using the Kaplan-Meier estimator:

```
library(survival)
fit <- survfit(Surv(rectime, censrec) ~ hormon, data = brcancer)
fit
```

```
Call: survfit(formula = Surv(rectime, censrec) ~ hormon, data = brcancer)
n events median 0.95LCL 0.95UCL
hormon=0 440 205 1528 1296 1814
hormon=1 246 94 2018 1918 NA
```

The plot that can be obtained via the `plot`

method is ok but needs a bit of work to be good enough for a publication. For instance, this is the default:

`plot(fit)`

*No bueno*, right? Let’s improve it a bit:

```
plot(fit, col = 1:2, lty = 1:2, conf.int = TRUE)
legend("bottomleft",
col = 1:2, lty = 1:2,
legend = c("Control", "Treatment"), bty = "n"
)
```

This is better, but still not great: the area defined by the confidence intervals is not shaded, and there is still no risk table.

Here’s when the {KMunicate} package comes to the rescue. First, we need to define the breaks for the x-axis; the risk table with be computed at those breaks. Say we want breaks every year:

```
time_breaks <- seq(0, max(brcancer$rectime), by = 365)
time_breaks
```

`[1] 0 365 730 1095 1460 1825 2190 2555`

Then, all we have to do is to

```
library(ggplot2)
library(KMunicate)
KMunicate(fit = fit, time_scale = time_breaks)
```

*Easy peasy!*

We might want to get proper arm labels too:

```
brcancer$hormon <- factor(brcancer$hormon, levels = 0:1, labels = c("Control", "Treatment"))
fit <- survfit(Surv(rectime, censrec) ~ hormon, data = brcancer)
KMunicate(fit = fit, time_scale = time_breaks)
```

Nice. Next, we’ll show how to customise the plot.

First, we might want to customise colours to use a colour-blind friendly palette via the `.color_scale`

and `.fill_scale`

arguments:

```
KMunicate(
fit = fit,
time_scale = time_breaks,
.color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
.fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2")
)
```

Then, we might want to use a custom font, such as my latest obsession Victor Mono, via the `.ff`

argument:

```
KMunicate(
fit = fit,
time_scale = time_breaks,
.color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
.fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
.ff = "Victor Mono"
)
```

Finally, we customise the overall theme using e.g. `theme_minimal`

from the {ggplot2} package:

```
KMunicate(
fit = fit,
time_scale = time_breaks,
.color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
.fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
.ff = "Victor Mono",
.theme = ggplot2::theme_minimal(base_family = "Victor Mono")
)
```

When overriding the default theme, we need to re-define the font for the main plot using the `base_family`

argument of a `theme_*`

component. Overall, I think this is a much better plot!

The final step consists of exporting a plot for later use, e.g. in manuscripts or presentations. That’s straightforward, being the output of `KMunicate()`

a `ggplot2`

-type object: all we have to do is use the `ggplot2::ggsave`

function, e.g. in the next block of code.

```
p <- KMunicate(
fit = fit,
time_scale = time_breaks,
.color_scale = ggplot2::scale_colour_brewer(type = "qual", palette = "Dark2"),
.fill_scale = ggplot2::scale_fill_brewer(type = "qual", palette = "Dark2"),
.ff = "Victor Mono",
.theme = ggplot2::theme_minimal(base_family = "Victor Mono")
)
ggplot2::ggsave(p, filename = "export.png", height = 6, width = 6, dpi = 300)
```

Further details on {KMunicate} can be found on its website, with more examples and a better explanation of the different arguments and customisation options. Let me know if you find the package useful, and if you find any bug (I’m sure there’ll be some) please file an issue on GitHub.

And what about Tim’s challenge that led to the inception of {KMunicate}, you might ask? Well, I got myself a beautiful hand-crafted wooden spoon:

Isn’t that great!?

As I mentioned in the previous blog post, I have been working on a new section will a list of talks I have given in the past few years, including slides (whenever possible).

Well, that section now live! The cool thing is that it builds dynamically using Hugo content, and it can easily be updated by simply adding a new markdown file with the appropriate YAML header. I think the overly popular Academic theme uses a similar approach for the various kind of contents it supports, but I am not sure.

I had been thinking about this for a while, and I finally decided to remove all analytics from all the websites I run: basically, this website and the `pkgdown`

websites for the {rsimsum} and {comorbidity} packages.

I mean, it’s cool to see how many people visit each every month, and I was flattered to see people from all over the world accessing my website… but I don’t really need any of that data (funny thing, coming from a *data all the things* kind of person). You don’t need further tracking while you visit the web, that for sure.

Incidentally, I recently bought a Raspberry Pi Zero W on which I am now running Pi-hole, a DNS sinkhole that blocks an incredible amount of junk at a network level. I mean, look at all that crap:

… and that’s just over the past 24 hours!

An additional bonus is *less* JavaScript to load, which means that… the website is even faster! Yeah, I know, it must be an ongoing joke by now.

I want to play around with the Raspberry Pi more, so I will be probably writing about it more. It’s a fun, tiny little computer that can do all sorts of computer stuff — and if you didn’t know, I have an irrational obsession for tiny and cute little computers!

I will keep it short: this is all for now, see you next week for a new episode of *blog updates weekly*, a new series on Netflix. Just kidding, Netflix is totally not paying for this… right? Happy to be proved wrong here though 😂 Cheers!

Notice anything different around here? Yes, your eyes are not failing you… we have a new shiny theme!

I wrote a little bit about the design of the original theme when introducing the blog, but after a little while, I wasn’t really happy with it. I mean, it was quite nice (I think) and performed decently, but I really, really wanted to build something from scratch. Let’s call it my own kind of *social distancing*…

That’s why a few weeks ago I started refactoring the website: I branched out and started building a new Hugo theme from scratch. The design is really similar to the previous one, with the largest difference being the typography: I wanted to reduce the overhead that comes with importing web fonts, so I went for a system font stack. This will *not* look the same everywhere, but it will be faster and (most importantly) it will feel *native* on every device.

I also decided to use Bulma.io to build the website: that was loads of fun (🙃), and I think I can finally say that I get CSS frameworks. It’s still a quite heavy dependency (as Bootstrap was), but there are options to stip it down to the bare essentials to increase performance; I started having a look into it, and I might end up fiddling more around that idea…

So, what’s changed around here?

In practice, not much: the navbar is now a proper navbar, and (as I mentioned before) I abandoned my beloved Noto Serif (I’ll find a good use for you, one day!).

The colours have also been slightly tweaked, as I have been reading a bit more about the science of colour contrast and good practices in web design. Unsurprisingly, colour science is a whole lot of science… but nevertheless, the whole website should *feel* similar. No drastic changes here.

On the top left corner (and in the tab bar on your browser) you’ll notice we have a new logo/favicon! The icon is made by Vectors Market from www.flaticon.com, and I customised it to match the theme of the website. Moreover, it should now look better on mobile devices.

I am also now using Hugo’s default syntax highlighting engine Chroma instead of the `highlight.js`

library; Chroma is server-side and blazing fast, and we don’t have to import unnecessary JavaScript anymore. Yeah, most JavaScript is now gone, so performance should be even better. This also allowed me to set up a different highlighting style for *light* and *dark* mode: GitHub and Dracula, respectively. This should match the overall style of the website a bit better. The big downside of Chroma is that it doesn’t support Stata syntax highlighting (yet), but I guess I can live with that.

I also took the opportunity to start implementing a new section of the website collecting a list of talks I have given in the past few years, ever since I started my PhD basically. The whole thing is built on-the-go using Hugo’s templates and a specific content type, which was quite cool (and fun) to implement. It is not online yet, as I have to figure out some issues with Netlify/Hugo messing up HTML slides… but it’s *coming soon*™!

This is pretty much it… not a lot has changed on the front, but under the hood, well, I let you be the judge:

I would call that *a decent amount of changes* if you ask me.

As always, let me know if you spot something that looks odd and take care!

]]>