The other day I started working with a colleague on propagating uncertainty for a negative binomial generalized linear model. To my surprise I found that there was a couple of different ways to define the probability distribution that led to more than a bit of confusion on my end, but got me messing around with how you could simulate over-dispersed count data in a couple of ways. It also got me working with TMB again and the woes of dealing with nonlinear optimization.
The first model is a mixture distribution model which incorporates a poisson distribution but also a gamma distribution to account for the extra variance. from what I have seen this is referred to as an NB1 model and looks like this.
Our distribution has a mean of but that persists through both distributions but the variance is greater than just a regular poisson distribution because of the randomness introduced by the Gamma distribution which itself has a variance of . Getting this to work for a linear model with a log link required me to introduce a new random variable which would account for the over-dispersion and a full log link model follows the form
where is my matrix of covariate data and are their coefficients for the linear model.
Simulating this data was cake but when I used TMB to fit the model I found that my model often failed to converge depending on what the start values of my parameters were. I could get around it by fitting the model with out first to get good start values for but even so the model was slow to fit.
Alternatively we could model the extra variance directly without any extra term $$\nu$ by using a negative binomial function that comes built in with both R and TMB. This way saves us the hassle of having to simulate from two different distributions and the two parameter model can easily be set up so that it has a mean setting parameter and a variance parameter. The NB2 as well call it looks like this.
Fitting a log linear model into here just means estimating with the same without the parameter. The nice thing is we still get the same distribution as the first model where corresponds to . The even nicer thing is that I didn’t have any problems fitting this model and it was way faster for TMB due to not having a whole other set of parameters to estimate.
A working version of the code exists here