Multiple linear regression with {TMB}

· 9 minutes read

I have been recently reading about the TMB framework to implement statistical models using automatic differentiation (and with a bunch of nice built-in features).

Automatic differentiation (in brief) is an algorithmic method that automagically and efficiently yields accurate derivatives of a given value. This is very interesting, as by using automatic differentiation we can easily get gradients and the Hessian matrix. Which are extremely useful when fitting models using the maximum likelihood method! On top of that, automatic differentiation is generally more efficient and accurate than symbolic or numerical differentiation.

There’s a bunch of examples of using {TMB} in practice, some are very simple and straightforward to follow, some are much more complex. I am just getting started, so I went straight to the linear regression example on the TMB documentation webpage. The C++ template for that simple model is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_VECTOR(Y);
  DATA_VECTOR(x);
  PARAMETER(a);
  PARAMETER(b);
  PARAMETER(logSigma);
  ADREPORT(exp(2*logSigma));
  Type nll = -sum(dnorm(Y, a+b*x, exp(logSigma), true));
  return nll;
}

…which can be compiled from within R and passed to any general-purpose optimiser (such as optim) to obtain maximum likelihood estimates for a linear regression model.

It is however interesting to generalise this to any number of covariates, using matrix-by-array multiplication to efficiently scale up any problem. This is relevant to implement general statistical modelling packages. Surprisingly, I had to fiddle around way more than I expected to do that: therefore, I thought about writing a blog post (which hopefully could be useful to others trying to get started with TMB)!

Disclaimer: I am no {TMB} nor C++ expert, so forgive me if I am missing something and let me know if there’s anything that needs fixing here.

So, let’s get started. We first need to write our C++ function for the negative log-likelihood function, which is a simple adaptation of the template we saw before for a single-covariate model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_VECTOR(Y);
  DATA_MATRIX(X);
  PARAMETER_VECTOR(b);
  PARAMETER(logSigma);
  Type nll = sum(dnorm(Y, X*b , exp(logSigma), true));
  return -nll;
}

Here we needed to change X to a DATA_MATRIX(), and b to a PARAMETER_VECTOR. Note that X here should be the design matrix of the model, e.g. obtained using the model.matrix() function in R, and that the matrix-array multiplication in C++ X*b is not the element-wise operation, and is equivalent to X %*% b in R. This is it: easy right?

Let’s now verify that this works as expected. We can compile the C++ function and dynamically link it using the tools from the {TMB} package (assuming model.cpp contains the function defined above):

1
2
3
library(TMB)
compile("model.cpp")
dyn.load(dynlib("model"))

We need to simulate some data now, e.g. with a single covariate for simplicity:

1
2
3
4
5
6
7
set.seed(42)
n <- 1000
x <- rnorm(n = n, mean = 50, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 0
y <- b0 + b1 * x + e

We create a dataset and a model matrix too:

1
2
3
df <- data.frame(y, x)
X <- model.matrix(y ~ x, data = df)
head(X)
##   (Intercept)        x
## 1           1 63.70958
## 2           1 44.35302
## 3           1 53.63128
## 4           1 56.32863
## 5           1 54.04268
## 6           1 48.93875

Now we can use the MakeADFun() from {TMB} to construct the R object that brings all of this together:

1
2
3
4
5
f <- MakeADFun(
  data = list(X = X, Y = y),
  parameters = list(b = rep(0, ncol(X)), logSigma = 0),
  silent = TRUE
)

This passes the data X and y and sets the default values of the model parameters to zeros for all regression coefficients (b0 and b1 in this case) and for the log of the standard deviation of the residual errors.

We can now easily get the value of the (negative) log-likelihood function at the default parameters, gradients, and the Hessian matrix:

1
f$fn(x = f$par)
## [1] 51351.44
1
f$gr(x = f$par)
##           [,1]      [,2]      [,3]
## [1,] -9994.682 -497251.6 -99865.01
1
f$he(x = f$par)
##          [,1]       [,2]      [,3]
## [1,]  1000.00   49741.76  19989.36
## [2,] 49741.76 2574646.64 994503.22
## [3,] 19989.36  994503.22 201730.02

The cool part is, we didn’t even have to define analytical formulae for gradients and the Hessian and we got it for free with automatic differentiation! Now all we have to do is to pass the negative log-likelihood function and the gradients to e.g. optim() and that’s all:

1
2
fit <- optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")
fit
## $par
##            b            b     logSigma 
##  9.945838718  0.000981937 -0.014587405 
## 
## $value
## [1] 1404.351
## 
## $counts
## function gradient 
##       35       35 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

If we exponentiate the value of log(Sigma), we obtain the standard deviation of the residuals on the proper scale: 0.9855. Compare this with the true values (b0 = 10, b1 = 0, log(Sigma) = 0) and with the results of the least squares estimator:

1
summary(lm(y ~ x, data = df))
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9225 -0.6588 -0.0083  0.6628  3.5877 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.9458454  0.1579727  62.959   <2e-16 ***
## x           0.0009818  0.0031133   0.315    0.753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9865 on 998 degrees of freedom
## Multiple R-squared:  9.964e-05,	Adjusted R-squared:  -0.0009023 
## F-statistic: 0.09945 on 1 and 998 DF,  p-value: 0.7526

We are getting really close here!

Let’s now generalise this to multiple covariates, say 4 normally-distributed covariates:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
set.seed(42)
n <- 1000
x1 <- rnorm(n = n, mean = 10, sd = 10)
x2 <- rnorm(n = n, mean = 20, sd = 10)
x3 <- rnorm(n = n, mean = 30, sd = 10)
x4 <- rnorm(n = n, mean = 40, sd = 10)
e <- rnorm(n = n)
b0 <- 10
b1 <- 1
b2 <- 2
b3 <- 3
b4 <- 4
y <- b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + e
df <- data.frame(y, x1, x2, x3, x4)
X <- model.matrix(y ~ x1 + x2 + x3 + x4, data = df)
head(X)
##   (Intercept)        x1       x2        x3       x4
## 1           1 23.709584 43.25058 32.505781 33.14338
## 2           1  4.353018 25.24122 27.220760 32.07286
## 3           1 13.631284 29.70733 12.752643 35.92996
## 4           1 16.328626 23.76973  9.932951 28.51329
## 5           1 14.042683 10.04067 17.081917 51.15760
## 6           1  8.938755 14.02517 33.658382 31.20543

All we have to do is re-create the f object using the MakeADFun() function and the new data:

1
2
3
4
5
f <- MakeADFun(
  data = list(X = X, Y = y),
  parameters = list(b = rep(0, ncol(X)), logSigma = 0),
  silent = TRUE
)

Let’s now fit this with optim() and compare with the least squares estimator:

1
optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B")
## $par
##          b          b          b          b          b   logSigma 
## 9.90325169 0.99992371 2.00125295 2.99795882 4.00296201 0.01661648 
## 
## $value
## [1] 1435.576
## 
## $counts
## function gradient 
##      103      103 
## 
## $convergence
## [1] 0
## 
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
1
summary(lm(y ~ x1 + x2 + x3 + x4, data = df))
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3354 -0.7382  0.0161  0.6893  3.0315 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.903788   0.177893   55.67   <2e-16 ***
## x1          0.999929   0.003218  310.72   <2e-16 ***
## x2          2.001249   0.003277  610.69   <2e-16 ***
## x3          2.997959   0.003134  956.53   <2e-16 ***
## x4          4.002949   0.003266 1225.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 995 degrees of freedom
## Multiple R-squared:  0.9997,	Adjusted R-squared:  0.9997 
## F-statistic: 7.294e+05 on 4 and 995 DF,  p-value: < 2.2e-16

Very close once again, and if you noticed, we didn’t have to change a single bit of code from the first example with a single covariate. Isn’t that cool?

I know, I know, maybe it’s cool for a specific type of person only, but hey…

Another nice thing is that given that we have gradients (in compiled code!), maximising the likelihood will be faster and generally more accurate if using a method that relies on gradients, such as the limited-memory modification of the BFGS quasi-Newton method that we chose here. See the following benchmark as a comparison:

1
2
3
4
5
6
7
8
library(bench)
bench::mark(
  "with gradients" = optim(par = f$par, fn = f$fn, gr = f$gr, method = "L-BFGS-B"),
  "without gradients" = optim(par = f$par, fn = f$fn, method = "L-BFGS-B"),
  iterations = 20,
  relative = TRUE,
  check = FALSE
)
## # A tibble: 2 x 6
##   expression          min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>        <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
## 1 with gradients     1      1         4.17         1      NaN
## 2 without gradients  4.35   4.15      1            1      Inf

In this case (a small and simple toy problem) the optimisation using gradients is ~4 times faster (on my quad-core MacBook Pro with 16 Gb of RAM). And all of that for free, just using automatic differentiation!

Remember that, if not provided with a gradient function, optim() will use finite differences to calculate the gradients numerically. Then, the benchmark above gives us a direct comparison with numerical differentiation too.

To wrap up, hats off to the creators of the {TMB} package for providing such an easy and powerful framework, and I didn’t even scratch the surface here: this is just a quick and dirty toy example with multiple linear regression, which doesn’t event needs maximum likelihood.

Anyway, I’m definitely going to come back to automatic differentiation and {TMB} in future blog posts, stay tuned for that. Cheers!

Update: Benjamin Christoffersen shared on Twitter a link to a thread on Cross Validated where the differences between maximum likelihood and least squares are discussed in more detail. It’s a very interesting read, remember to check it out!