Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: interpretting the estat gof commands and Hosmer Lemeshow version of it


From   Doug Hess <douglasrhess@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: interpretting the estat gof commands and Hosmer Lemeshow version of it
Date   Sun, 18 Sep 2011 14:09:40 -0400

Given all the cautions in Hosmer & Lemeshow's book, I'm a bit confused
as to what role and what interpretation should be given to the tests
that -estat gof- produces with and with the -group- option. The
results are below.

Without the grouping option, Peason chi2 gives P>chi2= 0.9999.
However, with groups (and the number of groups doesn't seem to matter
unless you have a very large number), the Hosmer Lemeshow method gives
P>chi2=0.0000.  From the R  manual (p.958-9) and Hosmer & Lemeshow's
book (p.150 of the 2000 edition) I gather that the null hypothesis is
the same for both.

So, why the large difference? Is one more appropriate, or do both have
problems when the outcome is somewhat rare (11 percent of observations
have y=1 in my case).

I see in Stata's R manual it says "However, the number of covariate
patterns is close to the number of observations, making the
applicability of the Pearson chi 2 test questionable but not
necessarily inappropriate" (p. 958). I have roughly 140,000
observations (households) and roughly 109,000 covariate patterns. If
this difference is important in deciding which of these tests to use,
what is the threshold for close are far distance between number of
observations and number of patterns?  (It may help to know that there
are only roughly 70,000 covariate patterns, half the sample size
number, if I remove a half dozen continuous variables (which I am
thinking of doing by collapsing them into one or two scales or
factors).)

If it helps, here are some additional details: My logistic model (11
percent of observations are y=1) has an optimal cutoff point for
maximizing the senstivity and specifcity at 0.10, which gives
approximately 75 percent for both senstivity and specifcity.  The area
under the ROC curve is 0.83. I'm using Stata 12.

. estat gof

number of observations	=	143585
number of covariate patterns	=	108638
Pearson chi2(108575)	=	106784.16
Prob > chi2	=	0.9999

. estat gof, g(10) table

number of observations =    143585
number of groups =        10
Hosmer-Lemeshow chi2(8) =       322.31
Prob > chi2 =         0.0000


Decile    Pred Prob 	 Obs y=1 	 Exp y=1 	 Total  	 Diff 	 % diff
1	 0.019 	 115 	 190 	 14,359 	 75 	65%
2	 0.025 	 194 	 315 	 14,358 	 121 	62%
3	 0.034 	 305 	 419 	 14,359 	 114 	37%
4	 0.044 	 443 	 560 	 14,359 	 117 	26%
5	 0.055 	 671 	 704 	 14,361 	 33 	5%
6	 0.072 	 864 	 904 	 14,355 	 40 	5%
7	 0.100 	 1,379 	 1,213 	 14,359 	 166 	12%
8	 0.163 	 2,122 	 1,827 	 14,358 	 295 	14%
9	 0.302 	 3,615 	 3,207 	 14,359 	 408 	11%
10	 0.856 	 6,175 	 6,543 	 14,358 	 368 	6%
	 Sum= 	 15,883 	 15,883 	 143,585 		

I removed the observed and expected columns for y=0 for
formatting/simplicity. The column diff is the absolute value of Obs
minus Exp. The last column is that previous value as a percentage of
Obs y=1.

Thank you.

-Doug
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index