Multiple linear regression with {TMB}
· 9 minutes read
I have been recently reading about the TMB framework to implement statistical models using automatic differentiation (and with a bunch of nice builtin features).
Automatic differentiation (in brief) is an algorithmic method that automagically and efficiently yields accurate derivatives of a given value. This is very interesting, as by using automatic differentiation we can easily get gradients and the Hessian matrix. Which are extremely useful when fitting models using the maximum likelihood method! On top of that, automatic differentiation is generally more efficient and accurate than symbolic or numerical differentiation.
There’s a bunch of examples of using {TMB} in practice, some are very simple and straightforward to follow, some are much more complex. I am just getting started, so I went straight to the linear regression example on the TMB documentation webpage. The C++ template for that simple model is:


…which can be compiled from within R and passed to any generalpurpose optimiser (such as optim
) to obtain maximum likelihood estimates for a linear regression model.
It is however interesting to generalise this to any number of covariates, using matrixbyarray multiplication to efficiently scale up any problem. This is relevant to implement general statistical modelling packages. Surprisingly, I had to fiddle around way more than I expected to do that: therefore, I thought about writing a blog post (which hopefully could be useful to others trying to get started with TMB)!
Disclaimer: I am no {TMB} nor C++ expert, so forgive me if I am missing something and let me know if there’s anything that needs fixing here.
So, let’s get started. We first need to write our C++ function for the negative loglikelihood function, which is a simple adaptation of the template we saw before for a singlecovariate model:


Here we needed to change X
to a DATA_MATRIX()
, and b
to a PARAMETER_VECTOR
.
Note that X
here should be the design matrix of the model, e.g. obtained using the model.matrix()
function in R, and that the matrixarray multiplication in C++ X*b
is not the elementwise operation, and is equivalent to X %*% b
in R.
This is it: easy right?
Let’s now verify that this works as expected.
We can compile the C++ function and dynamically link it using the tools from the {TMB} package (assuming model.cpp
contains the function defined above):


We need to simulate some data now, e.g. with a single covariate for simplicity:


We create a dataset and a model matrix too:


## (Intercept) x
## 1 1 63.70958
## 2 1 44.35302
## 3 1 53.63128
## 4 1 56.32863
## 5 1 54.04268
## 6 1 48.93875
Now we can use the MakeADFun()
from {TMB} to construct the R object that brings all of this together:


This passes the data X
and y
and sets the default values of the model parameters to zeros for all regression coefficients (b0
and b1
in this case) and for the log of the standard deviation of the residual errors.
We can now easily get the value of the (negative) loglikelihood function at the default parameters, gradients, and the Hessian matrix:


## [1] 51351.44


## [,1] [,2] [,3]
## [1,] 9994.682 497251.6 99865.01


## [,1] [,2] [,3]
## [1,] 1000.00 49741.76 19989.36
## [2,] 49741.76 2574646.64 994503.22
## [3,] 19989.36 994503.22 201730.02
The cool part is, we didn’t even have to define analytical formulae for gradients and the Hessian and we got it for free with automatic differentiation!
Now all we have to do is to pass the negative loglikelihood function and the gradients to e.g. optim()
and that’s all:


## $par
## b b logSigma
## 9.945838718 0.000981937 0.014587405
##
## $value
## [1] 1404.351
##
## $counts
## function gradient
## 35 35
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
If we exponentiate the value of log(Sigma)
, we obtain the standard deviation of the residuals on the proper scale: 0.9855.
Compare this with the true values (b0
= 10, b1
= 0, log(Sigma)
= 0) and with the results of the least squares estimator:


##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## 2.9225 0.6588 0.0083 0.6628 3.5877
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 9.9458454 0.1579727 62.959 <2e16 ***
## x 0.0009818 0.0031133 0.315 0.753
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9865 on 998 degrees of freedom
## Multiple Rsquared: 9.964e05, Adjusted Rsquared: 0.0009023
## Fstatistic: 0.09945 on 1 and 998 DF, pvalue: 0.7526
We are getting really close here!
Let’s now generalise this to multiple covariates, say 4 normallydistributed covariates:


## (Intercept) x1 x2 x3 x4
## 1 1 23.709584 43.25058 32.505781 33.14338
## 2 1 4.353018 25.24122 27.220760 32.07286
## 3 1 13.631284 29.70733 12.752643 35.92996
## 4 1 16.328626 23.76973 9.932951 28.51329
## 5 1 14.042683 10.04067 17.081917 51.15760
## 6 1 8.938755 14.02517 33.658382 31.20543
All we have to do is recreate the f
object using the MakeADFun()
function and the new data:


Let’s now fit this with optim()
and compare with the least squares estimator:


## $par
## b b b b b logSigma
## 9.90325169 0.99992371 2.00125295 2.99795882 4.00296201 0.01661648
##
## $value
## [1] 1435.576
##
## $counts
## function gradient
## 103 103
##
## $convergence
## [1] 0
##
## $message
## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"


##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## 3.3354 0.7382 0.0161 0.6893 3.0315
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 9.903788 0.177893 55.67 <2e16 ***
## x1 0.999929 0.003218 310.72 <2e16 ***
## x2 2.001249 0.003277 610.69 <2e16 ***
## x3 2.997959 0.003134 956.53 <2e16 ***
## x4 4.002949 0.003266 1225.83 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.019 on 995 degrees of freedom
## Multiple Rsquared: 0.9997, Adjusted Rsquared: 0.9997
## Fstatistic: 7.294e+05 on 4 and 995 DF, pvalue: < 2.2e16
Very close once again, and if you noticed, we didn’t have to change a single bit of code from the first example with a single covariate. Isn’t that cool?
I know, I know, maybe it’s cool for a specific type of person only, but hey…
Another nice thing is that given that we have gradients (in compiled code!), maximising the likelihood will be faster and generally more accurate if using a method that relies on gradients, such as the limitedmemory modification of the BFGS quasiNewton method that we chose here. See the following benchmark as a comparison:


## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 with gradients 1 1 3.97 1 NaN
## 2 without gradients 4.17 4.20 1 1 Inf
In this case (a small and simple toy problem) the optimisation using gradients is ~4 times faster (on my quadcore MacBook Pro with 16 Gb of RAM). And all of that for free, just using automatic differentiation!
Remember that, if not provided with a gradient function, optim()
will use finite differences to calculate the gradients numerically.
Then, the benchmark above gives us a direct comparison with numerical differentiation too.
To wrap up, hats off to the creators of the {TMB} package for providing such an easy and powerful framework, and I didn’t even scratch the surface here: this is just a quick and dirty toy example with multiple linear regression, which doesn’t event needs maximum likelihood.
Anyway, I’m definitely going to come back to automatic differentiation and {TMB} in future blog posts, stay tuned for that. Cheers!
Update: Benjamin Christoffersen shared on Twitter a link to a thread on Cross Validated where the differences between maximum likelihood and least squares are discussed in more detail. It’s a very interesting read, remember to check it out!