Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)


From   Steve Samuels <[email protected]>
To   [email protected]
Subject   Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)
Date   Tue, 23 Jul 2013 17:42:29 -0400

I suggest that you notify Stata support of the different behavior of -estat gof- and subpop() with and without "quietly"

Steve
On Jul 23, 2013, at 9:32 AM, Ángel Rodríguez Laso wrote:

Dear Stas and Steve,

Yes, I'm aware of the limitations of using if instead of subpop. And I
think that there shouldn't be big differences in estat gof in my case
between svy and non svy commands because standard errors are not that
different in both cases (there was a mistake in my previous email
becuause I didn't include [pweight=pond_nacional] in the logistic
version). Now:

. svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i.
> estud4 i.difinmes3
(running logistic on estimation sample)

Survey: Logistic regression

Number of strata   =        41                  Number of obs      =      1727
Number of PSUs     =       234                  Population size    = 1347,0862
                                               Subpop. no. of obs =       710
                                               Subpop. size       =    563,75
                                               Design df          =       193
                                               F(   7,    187)    =      8,32
                                               Prob > F           =    0,0000

------------------------------------------------------------------------------
            |             Linearized
    discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     edad_c |       1,10       0,02     4,42   0,000         1,05        1,15
            |
       sexo |
         1  |       1,00  (base)
         2  |       2,60       0,82     3,02   0,003         1,39        4,84
            |
     estud4 |
         0  |       1,00  (base)
         1  |       0,87       0,32    -0,38   0,704         0,43        1,78
         2  |       0,90       0,40    -0,24   0,807         0,37        2,16
         3  |       0,60       0,27    -1,14   0,257         0,24        1,47
            |
  difinmes3 |
         0  |       1,00  (base)
         1  |       1,59       0,57     1,31   0,190         0,79        3,21
         2  |       3,33       1,20     3,35   0,001         1,64        6,77
            |
      _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
------------------------------------------------------------------------------


. logistic discAVD edad_c i.sexo i.estud4 i.difinmes3
[pweight=pond_nacional] if disdesjub==1 & disdestr==1 & trab==1 &
dismy5
> 0==1 & proxy==2 & edad_c>=60

Logistic regression                               Number of obs   =        710
                                                 Wald chi2(7)    =      56,60
                                                 Prob > chi2     =     0,0000
Log pseudolikelihood = -171,06907                 Pseudo R2       =     0,1384

------------------------------------------------------------------------------
            |               Robust
    discAVD | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     edad_c |       1,10       0,02     4,58   0,000         1,06        1,15
            |
       sexo |
         1  |       1,00  (base)
         2  |       2,60       0,88     2,82   0,005         1,34        5,04
            |
     estud4 |
         0  |       1,00  (base)
         1  |       0,87       0,33    -0,37   0,715         0,42        1,82
         2  |       0,90       0,45    -0,22   0,826         0,34        2,38
         3  |       0,60       0,29    -1,06   0,290         0,23        1,56
            |
  difinmes3 |
         0  |       1,00  (base)
         1  |       1,59       0,54     1,37   0,169         0,82        3,10
         2  |       3,33       1,26     3,20   0,001         1,59        6,98
            |
      _cons |       0,00       0,00    -6,04   0,000         0,00        0,00
------------------------------------------------------------------------------




Unfortunately, esta gof does not work with pweights.


Interestingly enough, when I run:

. quietly svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 &
dismy50==1 & proxy==2 & edad_c>=60): logistic discAVD edad_c i
> .sexo i.estud4 i.difinmes3
.
end of do-file

. svylogitgof
  Number of observations =                            1727
F-adjusted test statistic = F(9,185) =                0,800
                Prob > F =                           0,617

-svylogitgof- seems to work with subpopolations. I remember having
asked Kelly Archer about svylogitgof and she told me that it had been
superseded by estat gof. I'll write her again and tell about these
problems with estat gof.

Best regards,

Angel Rodriguez Laso

2013/7/22 Stas Kolenikov <[email protected]>:
> The defaults of -estat gof- seem to be working in principally
> different ways with and without -svy-. Without the -svy- prefix,
> -estat gof- goes on to predict the probabilities in every covariate
> pattern cell. If you have a continuous variable in your regression,
> this may lead to a disaster, and it probably did in your case: I would
> have zero trust in a chi-square with 342 degrees of freedom based on
> 710 observations. Really? Do you know how many zero cells you have
> there, at least???
> 
> After -svy-, Archer-Lemeshow (note that this is a different test now,
> documented in Stata Journal) performs essentially what -estat gof,
> group()- does: it produces quantile groups of predicted probabilities,
> and looks into the Pearson test based on these groups (9 in your case;
> OK for your design with # of PSUs - # of strata = 234 - 41 = 193
> effective degrees of freedom). Basically, Archer-Lemeshow test knows
> that a lot of practical designs may have a low dozen degrees of
> freedom, and it conserves the degrees of freedom and does not try to
> go after all covariate combinations. In your example, doing so leads
> to 342 combinations, which is simply not sustainable with 193
> available degrees of freedom. -estat gof- is supported after -svy- in
> Stata 12.
> 
> I don't know if the reasons -estat gof- refuses to work after -svy,
> subpop()- are conceptual (subsetting the sample may produce low
> weighted count or zero cells; the denominators in the sum of cell
> contributions in the Pearson test have random subpopulation size,
> which needs to be properly accounted for) or programmatic (Stata Corp
> did not expect such use, and made -estat gof- overly protective of
> what it can work with). I can imagine that once e(V) is posted, which
> incorporated uncertainty due to random subpopulation size, everything
> else should follow -- which is what you do with an inappropriate -if-
> statement. (I hope you are aware of the dangers of using -if- with
> survey data, see
> http://www.stata-journal.com/article.html?article=st0153.)
> 
> -- Stas Kolenikov, PhD, PStat (ASA, SSC)
> -- Senior Survey Statistician, Abt SRBI
> -- Opinions stated in this email are mine only, and do not reflect the
> position of my employer
> -- http://stas.kolenikov.name
> 
> 
> 
> On Wed, Jul 17, 2013 at 5:23 AM, Ángel Rodríguez Laso
> <[email protected]> wrote:
>> Dear Statalisters,
>> 
>> Working with Stata 12.1.
>> 
>> 
>> If I carry out the following logistic regression in a survey setting
>> and then type estat gof I get:
>> 
>> 
>> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
>> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i. estud4
>> i.difinmes3
>> (running logistic on estimation sample)
>> 
>> Survey: Logistic regression
>> 
>> Number of strata   =        41                  Number of obs      =      1727
>> Number of PSUs     =       234                  Population size    = 1347,0862
>>                                                Subpop. no. of obs =       710
>>                                                Subpop. size       =    563,75
>>                                                Design df          =       193
>>                                                F(   7,    187)    =      8,32
>>                                                Prob > F           =    0,0000
>> 
>> ------------------------------------------------------------------------------
>>             |             Linearized
>>     discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
>> -------------+----------------------------------------------------------------
>>      edad_c |       1,10       0,02     4,42   0,000         1,05        1,15
>>             |
>>        sexo |
>>          1  |       1,00  (base)
>>          2  |       2,60       0,82     3,02   0,003         1,39        4,84
>>             |
>>      estud4 |
>>          0  |       1,00  (base)
>>          1  |       0,87       0,32    -0,38   0,704         0,43        1,78
>>          2  |       0,90       0,40    -0,24   0,807         0,37        2,16
>>          3  |       0,60       0,27    -1,14   0,257         0,24        1,47
>>             |
>>   difinmes3 |
>>          0  |       1,00  (base)
>>          1  |       1,59       0,57     1,31   0,190         0,79        3,21
>>          2  |       3,33       1,20     3,35   0,001         1,64        6,77
>>             |
>>       _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
>> ------------------------------------------------------------------------------
>> 
>> .
>> end of do-file
>> 
>> . estat gof
>> estat gof is not allowed after subpopulation estimations
>> r(198);
>> 
>> 
>> 
>> Then I change if statements for my subpopulation especifications:
>> 
>> 
>> . svy: logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if
>> disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 & proxy==2 &
>> edad_c>=60
>> (running logistic on estimation sample)
>> 
>> Survey: Logistic regression
>> 
>> Number of strata   =        41                  Number of obs      =       710
>> Number of PSUs     =       193                  Population size    =    563,75
>>                                                Design df          =       152
>>                                                F(   7,    146)    =      8,35
>>                                                Prob > F           =    0,0000
>> 
>> ------------------------------------------------------------------------------
>>             |             Linearized
>>     discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
>> -------------+----------------------------------------------------------------
>>      edad_c |       1,10       0,02     4,41   0,000         1,05        1,15
>>             |
>>        sexo |
>>          1  |       1,00  (base)
>>          2  |       2,60       0,82     3,02   0,003         1,39        4,85
>>             |
>>      estud4 |
>>          0  |       1,00  (base)
>>          1  |       0,87       0,32    -0,38   0,707         0,42        1,79
>>          2  |       0,90       0,40    -0,25   0,807         0,37        2,16
>>          3  |       0,60       0,27    -1,15   0,254         0,24        1,46
>>             |
>>   difinmes3 |
>>          0  |       1,00  (base)
>>          1  |       1,59       0,56     1,32   0,189         0,79        3,21
>>          2  |       3,33       1,18     3,39   0,001         1,65        6,72
>>             |
>>       _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
>> ------------------------------------------------------------------------------
>> 
>> . estat gof
>> 
>> Logistic model for discAVD, goodness-of-fit test
>> 
>>                     F(9,144) =       110,29
>>                     Prob > F =         0,0000
>> 
>> 
>> 
>> But if I get rid of the survey especifications, I get:
>> 
>> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if disdesjub==1
>> & disdestr==1 & trab==1 & dismy50==1 & proxy==2 & edad_c>=60
>> 
>> Logistic regression                               Number of obs   =        710
>>                                                  LR chi2(7)      =      65,87
>>                                                  Prob > chi2     =     0,0000
>> Log likelihood = -210,78135                       Pseudo R2       =     0,1351
>> 
>> ------------------------------------------------------------------------------
>>     discAVD | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
>> -------------+----------------------------------------------------------------
>>      edad_c |       1,10       0,02     5,28   0,000         1,06        1,14
>>             |
>>        sexo |
>>          1  |       1,00  (base)
>>          2  |       1,96       0,56     2,36   0,018         1,12        3,44
>>             |
>>      estud4 |
>>          0  |       1,00  (base)
>>          1  |       0,87       0,29    -0,42   0,676         0,45        1,69
>>          2  |       0,88       0,40    -0,28   0,781         0,36        2,14
>>          3  |       0,52       0,25    -1,37   0,170         0,21        1,32
>>             |
>>   difinmes3 |
>>          0  |       1,00  (base)
>>          1  |       1,89       0,61     1,97   0,049         1,00        3,57
>>          2  |       3,84       1,39     3,70   0,000         1,88        7,83
>>             |
>>       _cons |       0,00       0,00    -7,01   0,000         0,00        0,00
>> ------------------------------------------------------------------------------
>> 
>> . estat gof
>> 
>> Logistic model for discAVD, goodness-of-fit test
>> 
>>       number of observations =       710
>> number of covariate patterns =       350
>>            Pearson chi2(342) =       328,89
>>                  Prob > chi2 =         0,6852
>> 
>> 
>> The last two models don't look terribly different, so what is the
>> reason for a such a large change in the Hosmer&Lemeshow result? Which
>> one should I trust?
>> 
>> Thank you for your time and attention.
>> 
>> Angel Rodriguez-Laso
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index