Introduction;
Planning a simulation study;
Coding a simulation study;
Analysing a simulation study;
Reporting and visualising results;
Wrap-up.
Principal Statistical Methodologist at Red Door Analytics, here in Stockholm;
PhD in Biostatistics from the University of Leicester, UK;
Affiliated researcher at the Department of Medical Epidemiology and Biostatistics at Karolinska Institutet;
Co-chair of the openstatsware working group, recognised by both ASA BIOP and EFSPI;
Associate Editor for the journal Biostatistics.
Note that, from now on, I will be using the terms Monte Carlo simulation study and simulation study interchangeably.
In a simulation study, we design an experiment such that:
Generate data from a known distribution/model (so that we know the truth);
Analyse the data;
Compare the analysis results with the truth;
We (generally) repeat steps 1-3 B times;
Finally, after running the experiment B times, we collect and analyse the results to draw our conclusions.
Monte Carlo simulation studies provide an invaluable tool for statistical and biostatistical research — and are great for learning/understanding.
They can help to answer questions such as:
Is an estimator biased in a finite sample?
Do confidence intervals for a given parameter achieve the desired nominal level of coverage?
How does a newly developed method compare to an established one?
What is the power to detect a desired effect size under complex experimental settings and analysis methods?
…you name it.
Monte Carlo standard errors of performance measures are often not computed/reported;
Reproducibility of results;
Dissemination of results.
As statisticians, we should follow our advice — and could do better — when it comes to the design, reporting, and reproducibility of our own experiments.
The problem is that the average statistical training does not teach enough about practical aspects of simulation studies.
A is for Aims
D is for Data-generating mechanisms
E is for Estimands
M is for Methods
P is for Performance measures
I is for Implementation
Carefully thinking and planning a simulation study makes your life easier (and more reproducible) in the long run.
Before starting with a simulation study, we need to know what we want to learn. For instance:
Whether/when an estimation procedure converges?
Consistency? (large-sample properties)
Small-sample issues?
Correct variance formula? (e.g. when derived from asymptotic theory)
Comparison with other approaches?
Robustness to model misspecification?
The data-generating mechanisms [DGM] are processes that we assume generate the data under study. For instance, the process underlying a clinical trial (eligibility, randomisation, etc.) or an observational study.
Defining each DGM is aim-specific, but we often vary several factors at once: sample size, true value of the estimand of interest, additional DGM-related parameters, …
We can vary them factorially, or use e.g. fractional designs for greater efficiency: we do need to be pragmatic, after all.
The estimand is what we are estimating, the target of our inference.
It typically is a parameter (or a function of parameters), but it might be a summary measure (e.g. predictive performance) or a null hypothesis.
For instance, the estimand of interest in a simulation study mimicking a clinical trial might be the treatment effect.
This depends once again on the aims of the simulation study, hence the importance of defining that from the start.
Methods are often what we are studying, e.g. a newly-developed method or a method that we aim to apply in unusual settings.
A comparator is more often than not included, e.g. a gold standard. However, it is not always necessary to include one.
It is also recommended to include familiar methods as a check.
Let’s define the chosen estimand as \(\theta\), which is estimated by a method as \(\hat{\theta}\) with standard error \(\widehat{\text{se}}(\hat{\theta})\). Performance measures of interest might target:
Properties of the estimator, such as bias, empirical standard error, mean squared error;
Properties of the standard error of the estimator, such as average model-based standard error;
Properties of a confidence interval for \(\hat{\theta}\), such as coverage probability and power;
Convergence rates or computational speed.
Bias estimates how close the average estimate \(\hat{\theta}\) is to the true value \(\theta\): \(E(\hat{\theta}) - \theta\).
For instance, the maximum likelihood estimator is asymptotically unbiased.
The empirical standard error describes the precision of an estimator, and is defined as \(\sqrt{\text{var}(\hat{\theta})}\).
Ideally, we are happier with precise estimators with low variance.
The mean squared error (MSE) combines bias and emp. SE in an overall performance measure
It is defined as bias2 + (emp. SE)2.
Coverage probability of a confidence interval is defined as the proportion of confidence intervals containing the true value: \(P(\hat{\theta}_{\text{low}}, \hat{\theta}_{\text{upp}})\).
For a 95% CI, this probability targets 95%; poor coverage might arise because of (1) biased estimator, (2) biased model SEs, (3) poor distributional properties.
Implementation covers some of the very practical details of simulation studies:
Number of repetitions (not the number of observations in the simulated data!):
Software package;
Computational details.
We have a study which can be summarised by the following DAG:
Where \(X\) is the exposure, \(Y\) the outcome, and \(Z\) a confounder.
Note that the exposure and the outcome are independent: there is no edge between \(X\) and \(Y\).
The aim of this simulation study is pedagogical.
We know that there should be no association between \(X\) and \(Y\) once we adjust for \(Z\) in our analysis.
However:
Is it actually true that once we adjust for \(Z\) we observe no association between \(X\) and \(Y\)?
What happens if we fail to adjust for a \(Z\) in our analysis?
\(X\) is a binary treatment variable, \(Z\) is a continuous confounder, and \(Y\) is a continuous outcome;
\(Z \sim N(0, 1)\);
\(X\) depends on \(Z\) according to a logistic model: \[ \log [P(X = 1) / P(X = 0)] = \gamma_0 + \gamma_1 Z \]
\(Y\) depends on \(Z\) according to a linear model: \[ Y = \alpha_0 + \alpha_1 Z + \varepsilon \] with \(\varepsilon \sim N(0, 1)\).
We fix the parameters \(\gamma_0 = 1\), \(\gamma_1 = 3\), \(\alpha_0 = 10\).
For \(\alpha_1\), we actually simulate two scenarios:
\(\alpha_1 = 5\): in this scenario (DGM = 1), \(Z\) is actually a confounder;
\(\alpha_1 = 0\): in the second scenario (DGM = 2), we remove the edge between \(Z\) and \(Y\) as well. Here \(Z\) is not a confounder anymore.
Finally, we generate \(N = 200\) independent subjects per each simulated dataset.
We have a single estimand of interest here, the regression coefficient \(\beta_1\) (which we define how to estimate in the next slide) that quantifies the association between \(X\) and \(Y\).
Under certain structural assumptions (see e.g. John et. al. (2019), DOI: 10.1186/s12874-019-0858-x), the regression coefficient \(\beta_1\) can be interpreted as the average causal effect (ACE) of the exposure \(X\) on the outcome \(Y\).
Remember that, according to our data-generating mechanisms, we expect no association between \(X\) and \(Y\).
We estimate \(\beta_1\) using linear regression. Specifically, we fit the following two models:
A model that fails to adjust for the observed confounder \(Z\) (model 1): \[ Y = \beta_0 + \beta_1 X + \varepsilon \]
A model that properly adjusts for \(Z\) (model 2): \[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon \]
The key performance measure of interest is bias, defined as \[ E(\hat{\beta_1}) \] as we know that (recall our DGMs) the true \(\beta_1 = 0\).
We might also be interested in estimating the power of a significance test for \(\hat{\beta_1}\) at the \(\alpha = 0.05\) level. In our example, think of that as a test for the null hypothesis \(\beta_1 = 0\).
The rationale for picking the number of replications is that we need enough to estimate the key performance measures with good precision.
Think of this as having a large enough dataset for an epidemiological study, or power calculations for a clinical trial. It is roughly the same concept.
Our example is simple so we don’t need to worry too much about computational time. We can then run a large number of replications (\(B = 10,000\) or more).
In general, choosing the number of replications \(B = n_{\text{sim}}\) is not trivial, requiring a tradeoff between Precision and Computational Effort.
Here is an example from a paper we recently published:
We should also list all practical software details that are required to replicate the simulation:
Software package: R, Stata, etc.
User-written commands?
Arguments of functions or of estimation algorithms?
Numerical precision?
Key (and sometimes implicit) assumptions when implementing a method?
… and so on.
The most important thing to remember is that there is no right way of coding a simulation study.
However, your code should be:
Reproducible;
Immune to the bus factor;
Openly available (as much as possible).
The analysis of a simulation study is not different from a clinical study. You often start with descriptive statistics:
Inspect your data,
Compute summary statistics (mean, median, etc.),
Create some descriptive plots e.g. to check for outliers.
Then, you can move onto estimating the performance measures of interest: bias, coverage, etc.
Now, if you feel fancy, you can go into regression modelling, but that’s a much less established practice.
There are potentially 4 dimensions (D times E times M times P), so complexity can grow quickly.
You may want to think about:
Which is the comparison of interest?
Across DGMs or across methods?
Can my reporting strategy effectively tell the story of the study?
As with coding before, there is no right way to do this.
Generally, I recommend using plots rather than tables, as tables can easily become daunting.
See the supplementary material for the paper titled Impact of model misspecification in shared frailty survival models by Gasparini et al., available online at:
Warning: it’s not pretty!
With large simulation studies, we really want/need interactive graphics that allow the user to control what is displayed (or not).
Furthermore, more info (e.g., actual estimated values) could appear when hovering over plots.
Frameworks that let you do that are (among others) d3.js, Shiny, Plotly, etc.
We actually did develop and release an interactive Shiny app that does some of that:
GitHub repository: ellessenne/interest
Paper: DOI: 10.52933/jdssv.v1i4.9
Compared to, e.g., clinical trials or observational studies, simulation studies offer one of the most straightforward opportunities for replication.
Therefore:
Write up your study always assuming that other people will try to replicate it. Many actually do: see, e.g., the RepliSims Project: DOI: 10.1098/rsos.231003. The ADEMPI framework can help with this.
Another suggestion is to release the full code, e.g. on GitHub or other repositories. This allows direct replicability (and reproducibility!), and forces you to check (and organise, and document) your code thoroughly. Some journals even require you to share code nowadays.
Heinze et al. proposed a framework for thinking about how methodological work contributes to the evidence base for a method.
Is a method trustworthy enough to be used in practice?
Similar to the well-known phases of drug development.
Aims and simulation design can depend on the phase of research: scope, finite vs large sample, etc.
Recommended reading, DOI: 10.1002/bimj.202200222
Start by planning your simulation study on paper;
Structure and describe your study assuming that other people will try to reproduce it;
Build your code slowly, start small and only then generalise/expand it;
Report Monte Carlo standard errors to quantify uncertainty;
Make sure you’re selling the story of your study well, with engaging and effective visualisations and reporting.
…yes! Let’s recap some key uses:
Invaluable tool for learning (and teaching!) statistics;
Helpful to test/challenge your understanding methods;
Helpful to test/evaluate new methods you develop;
Helpful to compare alternative methods;
Helpful to assess performance when key assumptions are violated;
…and many more.
All material you have seen today is available online.
Slides: ellessenne.xyz/talks