Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)


From   Stas Kolenikov <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)
Date   Mon, 22 Jul 2013 13:55:47 -0500

The defaults of -estat gof- seem to be working in principally
different ways with and without -svy-. Without the -svy- prefix,
-estat gof- goes on to predict the probabilities in every covariate
pattern cell. If you have a continuous variable in your regression,
this may lead to a disaster, and it probably did in your case: I would
have zero trust in a chi-square with 342 degrees of freedom based on
710 observations. Really? Do you know how many zero cells you have
there, at least???

After -svy-, Archer-Lemeshow (note that this is a different test now,
documented in Stata Journal) performs essentially what -estat gof,
group()- does: it produces quantile groups of predicted probabilities,
and looks into the Pearson test based on these groups (9 in your case;
OK for your design with # of PSUs - # of strata = 234 - 41 = 193
effective degrees of freedom). Basically, Archer-Lemeshow test knows
that a lot of practical designs may have a low dozen degrees of
freedom, and it conserves the degrees of freedom and does not try to
go after all covariate combinations. In your example, doing so leads
to 342 combinations, which is simply not sustainable with 193
available degrees of freedom. -estat gof- is supported after -svy- in
Stata 12.

I don't know if the reasons -estat gof- refuses to work after -svy,
subpop()- are conceptual (subsetting the sample may produce low
weighted count or zero cells; the denominators in the sum of cell
contributions in the Pearson test have random subpopulation size,
which needs to be properly accounted for) or programmatic (Stata Corp
did not expect such use, and made -estat gof- overly protective of
what it can work with). I can imagine that once e(V) is posted, which
incorporated uncertainty due to random subpopulation size, everything
else should follow -- which is what you do with an inappropriate -if-
statement. (I hope you are aware of the dangers of using -if- with
survey data, see
http://www.stata-journal.com/article.html?article=st0153.)

-- Stas Kolenikov, PhD, PStat (ASA, SSC)
-- Senior Survey Statistician, Abt SRBI
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
-- http://stas.kolenikov.name



On Wed, Jul 17, 2013 at 5:23 AM, Ángel Rodríguez Laso
<[email protected]> wrote:
> Dear Statalisters,
>
> Working with Stata 12.1.
>
>
> If I carry out the following logistic regression in a survey setting
> and then type estat gof I get:
>
>
> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i. estud4
> i.difinmes3
> (running logistic on estimation sample)
>
> Survey: Logistic regression
>
> Number of strata   =        41                  Number of obs      =      1727
> Number of PSUs     =       234                  Population size    = 1347,0862
>                                                 Subpop. no. of obs =       710
>                                                 Subpop. size       =    563,75
>                                                 Design df          =       193
>                                                 F(   7,    187)    =      8,32
>                                                 Prob > F           =    0,0000
>
> ------------------------------------------------------------------------------
>              |             Linearized
>      discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>       edad_c |       1,10       0,02     4,42   0,000         1,05        1,15
>              |
>         sexo |
>           1  |       1,00  (base)
>           2  |       2,60       0,82     3,02   0,003         1,39        4,84
>              |
>       estud4 |
>           0  |       1,00  (base)
>           1  |       0,87       0,32    -0,38   0,704         0,43        1,78
>           2  |       0,90       0,40    -0,24   0,807         0,37        2,16
>           3  |       0,60       0,27    -1,14   0,257         0,24        1,47
>              |
>    difinmes3 |
>           0  |       1,00  (base)
>           1  |       1,59       0,57     1,31   0,190         0,79        3,21
>           2  |       3,33       1,20     3,35   0,001         1,64        6,77
>              |
>        _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
> ------------------------------------------------------------------------------
>
> .
> end of do-file
>
> . estat gof
> estat gof is not allowed after subpopulation estimations
> r(198);
>
>
>
> Then I change if statements for my subpopulation especifications:
>
>
> . svy: logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if
> disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 & proxy==2 &
> edad_c>=60
> (running logistic on estimation sample)
>
> Survey: Logistic regression
>
> Number of strata   =        41                  Number of obs      =       710
> Number of PSUs     =       193                  Population size    =    563,75
>                                                 Design df          =       152
>                                                 F(   7,    146)    =      8,35
>                                                 Prob > F           =    0,0000
>
> ------------------------------------------------------------------------------
>              |             Linearized
>      discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>       edad_c |       1,10       0,02     4,41   0,000         1,05        1,15
>              |
>         sexo |
>           1  |       1,00  (base)
>           2  |       2,60       0,82     3,02   0,003         1,39        4,85
>              |
>       estud4 |
>           0  |       1,00  (base)
>           1  |       0,87       0,32    -0,38   0,707         0,42        1,79
>           2  |       0,90       0,40    -0,25   0,807         0,37        2,16
>           3  |       0,60       0,27    -1,15   0,254         0,24        1,46
>              |
>    difinmes3 |
>           0  |       1,00  (base)
>           1  |       1,59       0,56     1,32   0,189         0,79        3,21
>           2  |       3,33       1,18     3,39   0,001         1,65        6,72
>              |
>        _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
> ------------------------------------------------------------------------------
>
> . estat gof
>
> Logistic model for discAVD, goodness-of-fit test
>
>                      F(9,144) =       110,29
>                      Prob > F =         0,0000
>
>
>
> But if I get rid of the survey especifications, I get:
>
> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if disdesjub==1
> & disdestr==1 & trab==1 & dismy50==1 & proxy==2 & edad_c>=60
>
> Logistic regression                               Number of obs   =        710
>                                                   LR chi2(7)      =      65,87
>                                                   Prob > chi2     =     0,0000
> Log likelihood = -210,78135                       Pseudo R2       =     0,1351
>
> ------------------------------------------------------------------------------
>      discAVD | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>       edad_c |       1,10       0,02     5,28   0,000         1,06        1,14
>              |
>         sexo |
>           1  |       1,00  (base)
>           2  |       1,96       0,56     2,36   0,018         1,12        3,44
>              |
>       estud4 |
>           0  |       1,00  (base)
>           1  |       0,87       0,29    -0,42   0,676         0,45        1,69
>           2  |       0,88       0,40    -0,28   0,781         0,36        2,14
>           3  |       0,52       0,25    -1,37   0,170         0,21        1,32
>              |
>    difinmes3 |
>           0  |       1,00  (base)
>           1  |       1,89       0,61     1,97   0,049         1,00        3,57
>           2  |       3,84       1,39     3,70   0,000         1,88        7,83
>              |
>        _cons |       0,00       0,00    -7,01   0,000         0,00        0,00
> ------------------------------------------------------------------------------
>
> . estat gof
>
> Logistic model for discAVD, goodness-of-fit test
>
>        number of observations =       710
>  number of covariate patterns =       350
>             Pearson chi2(342) =       328,89
>                   Prob > chi2 =         0,6852
>
>
> The last two models don't look terribly different, so what is the
> reason for a such a large change in the Hosmer&Lemeshow result? Which
> one should I trust?
>
> Thank you for your time and attention.
>
> Angel Rodriguez-Laso
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index