Home  /  Resources & support  /  FAQs  /  Goodness-of-fit chi-squared test with poisson
Note: This FAQ is for users of Stata 5. It is not relevant for more recent versions.

This question was originally posed on Statalist.

Stata 5: Why does the goodness-of-fit chi-squared test reported by poisson change when the counts and exposures are grouped differently?

Title   Stata 5: Goodness-of-fit chi-squared test reported by poisson
Author Bill Sribney, StataCorp

Question:

The version 5 documentation indicates the goodness-of-fit chi-squared statistic reported with the results of Poisson regression is a test of the null hypothesis that the dependent variable is Poisson distributed. My question is why this statistic (and perhaps the resulting inference regarding the appropriateness of Poisson regression) varies with the composition of the right-hand-side variables.

Answer:

The goodness-of-fit chi-squared statistic in the poisson command is a simple Pearson's chi-squared statistic:

     N
    Sum  (observed - expected)2 /expected
    i=1

where i indexes the observations in the dataset. The df is

    df = N - (#terms in model including the constant)

If you split up or group the counts and exposures differently, you get different cells for the Pearson's chi-squared and thus a different statistic.

Here’s an example using the first example in the poisson entry of the manual on page 31 of the P–Z Reference manual:

 . list
    
        airline   injuries         n   XYZowned  
   1.         1         11    0.0950          1  
   2.         2          7    0.1920          0  
   3.         3          7    0.0750          0  
   4.         4         19    0.2078          0  
   5.         5          9    0.1382          0  
   6.         6          4    0.0540          1  
   7.         7          3    0.1292          0  
   8.         8          1    0.0503          0  
   9.         9          3    0.0629          1  
 
 . poisson injuries XYZowned, exposure(n) irr
 
 Iteration 0: Log Likelihood = -23.90184
 Iteration 1: Log Likelihood = -23.032242
 Iteration 2: Log Likelihood = -23.027176
 
 Poisson regression, normalized by n                 Number of obs    =       9
 Goodness-of-fit chi2(7)     =    14.094             Model chi2(1)    =   1.768
 Prob > chi2                 =    0.0495             Prob > chi2      =  0.1836
 Log Likelihood              =   -23.027             Pseudo R2        =  0.0370
 
 ------------------------------------------------------------------------------
 injuries |        IRR   Std. Err.       z     P>|z|       [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 XYZowned |   1.463467    .406872      1.370   0.171       .8486578    2.523675
 ------------------------------------------------------------------------------

Now we will group the data by the unique covariate patterns of the model. In this case that simply amounts to grouping by XYZowned and summing counts (injuries) and exposure (n) within this grouping:

 . collapse (sum) injuries n, by(XYZowned)
    
 . list
    
       XYZowned   injuries           n  
   1.         0         46       .7925  
   2.         1         18       .2119  
 
 . poisson injuries XYZowned, exposure(n) irr
         
 Iteration 0: Log Likelihood = -5.2133484
 Iteration 1: Log Likelihood = -5.2038269
 
 Poisson regression, normalized by n                 Number of obs    =       2
 Goodness-of-fit chi2(0)     =     0.000             Model chi2(1)    =   1.768
 Prob > chi2                 =         .             Prob > chi2      =  0.1836
 Log Likelihood              =    -5.204             Pseudo R2        =  0.1452
 
 ------------------------------------------------------------------------------
 injuries |        IRR   Std. Err.       z     P>|z|       [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
 XYZowned |   1.463466   .4068718      1.370   0.171       .8486574    2.523673
 ------------------------------------------------------------------------------

Note that the IRR and std error are the same, but the goodness-of-fit test is different. From the standpoint of the Poisson regression, both the original and collapsed datasets are equivalent, but the first dataset has more information about the Poisson-ness of the data since you can examine the counts for small portions of exposure.

When the portions of exposure get too small, one gets the well-known problem of the expected counts for the Pearson chi-squared becoming small.

Perhaps Stata should automatically group by covariate pattern before doing the Pearson's chi-squared as lfit does after logistic. But in some cases, it is certainly legitimate NOT to group (this one is close to being one of these cases — injuries are just a little too low for some obs).

Note that Pearson’s chi-squared also has a problem when its df become large. This happens for poisson when the number of observation becomes large.

My personal rules of thumb:

  1. If the number of unique covariate patterns is not small (say greater than 20), then group on it for the gof test so that your dataset has only one observation per unique covariate pattern.
  2. Look at predicted (expected) counts. If there are any very small ones (< 2) or lots of small ones (< 5), view Pearson's chi-squared gof test with suspicion.
  3. If the df of the chi-squared is large (>50-100), take the result with a large grain of salt. (This is true for any chi-squared statistic.)