# st: overdispersion and underdispersion in nbreg / glm models

 From jhilbe@aol.com To statalist@hsphsun2.harvard.edu Subject st: overdispersion and underdispersion in nbreg / glm models Date Thu, 18 Dec 2008 14:49:19 -0500

I take the Digest, and try to scan through the contents when possible. I'm pleased that I happened
```to catch your query.

```
Overdispersion in count models can arise from a wide variety of reasons. Identifying the source of overdispersion can help in finding a remedy for it. In some cases the remedy is such that when applied, the model is no longer overdispersed. I call this apparent overdispersion. In other situations, the remedy does not eliminate the fact that the data is overdispersed, but it adjusts the model -- usually the standard errors - so that the effect of bias as a result of the overdispersion is minimized. These types of models
```are ones that have real overdispsersion.

```
In the book, I create a simulated Poisson model with 3 or 4 defined parameter estimates. That is, for example, I define xb = b0 + b1*x1 + b2*x2 + b3*x3 with specific values for b*; eg xb = 1 + .5*x1 + .75*x2 - 1.2*x3 The x* values are all separetely created random normal deviates; eg --gen x1= invnorm(uniform)-- [[now should be invnorm(runiform)]]. I then use the values of xb in the command, --rndpoisx-- or --genpoisson--. The result is a Poisson random variate, xp, structured by the values of xb. Running ---glm xp x1 x2 x3, fam(poi)-- results in a Poisson model with parameter estimates and intercept having values very close if not identical to the values specified. The Pearson dispersion statistic is also very close to 1.0.
```
```
I then remodel the data, taking out one of the predictors, let's say x1. --glm x2 x3, fam(poi)-- The parameter estimates are generally not the specified ones, and, more importantly, the dispersion statistic becomes
```greater than 1. Sometimes it is substantially greater than 1.

```
What does this tell us? Well, when we are modeling data, we generally don't know what the parameter estimates are going to be in advance. If we do find, though, that the dispersion statistic substantally differs from 1.0, then we know that the model is not well fitted. We may not know why though. In this case it was because a necessary predictor was missing from the model. In real situations, we hope that a variable is available in the data to remedy the fit; ie when put into the model, the dispersion closely approximates 1.0. The requisite predictor, however, may not have been collected. Again, in real situations, the missing predictor is one that is required to amend the extra correlation in the data, reflected by the dispersion statistic. All of this discussion is within the context of a Poisson model.
```
```
You appear to have modeled the data as negative binomial (NB-2) rather than Poisson. The way you obained the value for alpha for inclusion in the GLM NB model was correct. What many folks forget, though, is that the NB model can itself be extradispersed. It may, for example, have more variance in the data than allowed given the value of the mean. Rather than compare mu and mu, as in Poisson, here we compare mu and mu+a*mu*mu. The NB model may not adjust enough of the otherwise Poisson overdispersion, and has a dispersion statistic of <1. Or, it may overshoot and have excessive variance in the data - greater than mu+a*mu*mu.
```
```
When I discussed the missing predictor and how it affects dispersion, I was focusing on differentiating apparent from real overdispersion. I did not address the NB model. It had to do with eliminating overdispersion from within the Poisson model. Here you are doing something quite different. It appears to be a question as to why adding a particular predictor can change the model from being underdispersed (a<1) to overdispersed (a>1). But here we are referring to NB overdispersion, not Poisson overdispersion.
```
```
The addition of the new predictor evidently added considerably more correlation to the data. I'll suspect that if you display the correlations between all of the variables in the model, that the new predictor would be rather highly correlated with one of more of the other variables. However, the interaction of the variables may be such that the extra correlation may not show in such a manner, but that too is rare.
```
```
In any case, treat the inclusion of the varialble as any other predictor; test it using the likelihood ratio test. It likely does not contribute to the model. If so, exclude it, and search for other reasons why there may be underdispersion. It may be that the data is simply NB-underdispsersed (in distinction to poisson overdispersed), and adjustments can be made to the SEs, eg robust SEs. I suggest not scaling in this type of case, for reasons discussed in the book.
```
```
Perhaps I overkilled in my explanation, but I thought it important to clarify the relationships involved, and to show why the discussion of the missing predictor is not relevant to the solution of your query.
```
```
If you have additonal questions, you can contact me directly at hilbe@asu.edu
```
Joseph Hilbe

============================================
Date: Wed, 17 Dec 2008 10:36:00 +0000
Subject: st: overdispersion and underdispersion in nbreg / glm models

Dear Statalisters,

I'd been following Joseph Hilbe's book "Negative Binomial Regression"
(2007) and using some of my own data to try out methods laid out in
the book.

The book suggested that one can look at the Pearson's dispersion
output from the -glm- command to check if one's negative binomial
model is affected by underdispersion or overdispersion.

In the book it says that if one's model is affected by overdispersion,
it could be caused by missing explanatory variable.  But my model
seems to be suggesting quite the opposite and I am not sure what to
do.

When I added an explanatory variable to the model the Pearson's stats
went from being underdispersed to overdispersed.  Both models are
estimated using the -glm- command with the "family(nb XXX)" option
specified, XXX being the alpha value taken from the -nbreg- command
output.  Although the AIC and BIC of the model with the additional
variable looks better (lower), I really don't know what is worse.
What I should do in order to resolve the dispersion problem and
frankly speaking, are there other things that would tell me which
model is better?  Shall I bootstrap and jacknife???

All suggestions welcomed.

Regards,

- --
Research Fellow
Health Economics Research Unit
University of Aberdeen, UK.
http://www.abdn.ac.uk/heru/
Tel: +44 (0) 1224 553863
Fax: +44 (0) 1224 550926
*

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```