Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)


From   Ángel Rodríguez Laso <[email protected]>
To   [email protected]
Subject   Re: st: estat gof (Hosmer & Lemeshow) after svy:logistic (survey)
Date   Wed, 24 Jul 2013 15:40:27 +0200

Thank you, Steve, but the different behavior is not due to the use of
-quietly-. When I used quietly I was using -svylogitgof- instead of
-estat gof-.

Best regards,

Angel Rodriguez-Laso

2013/7/23 Steve Samuels <[email protected]>:
> I suggest that you notify Stata support of the different behavior of -estat gof- and subpop() with and without "quietly"
>
> Steve
> On Jul 23, 2013, at 9:32 AM, Ángel Rodríguez Laso wrote:
>
> Dear Stas and Steve,
>
> Yes, I'm aware of the limitations of using if instead of subpop. And I
> think that there shouldn't be big differences in estat gof in my case
> between svy and non svy commands because standard errors are not that
> different in both cases (there was a mistake in my previous email
> becuause I didn't include [pweight=pond_nacional] in the logistic
> version). Now:
>
> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i.
>> estud4 i.difinmes3
> (running logistic on estimation sample)
>
> Survey: Logistic regression
>
> Number of strata   =        41                  Number of obs      =      1727
> Number of PSUs     =       234                  Population size    = 1347,0862
>                                                Subpop. no. of obs =       710
>                                                Subpop. size       =    563,75
>                                                Design df          =       193
>                                                F(   7,    187)    =      8,32
>                                                Prob > F           =    0,0000
>
> ------------------------------------------------------------------------------
>             |             Linearized
>     discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>      edad_c |       1,10       0,02     4,42   0,000         1,05        1,15
>             |
>        sexo |
>          1  |       1,00  (base)
>          2  |       2,60       0,82     3,02   0,003         1,39        4,84
>             |
>      estud4 |
>          0  |       1,00  (base)
>          1  |       0,87       0,32    -0,38   0,704         0,43        1,78
>          2  |       0,90       0,40    -0,24   0,807         0,37        2,16
>          3  |       0,60       0,27    -1,14   0,257         0,24        1,47
>             |
>   difinmes3 |
>          0  |       1,00  (base)
>          1  |       1,59       0,57     1,31   0,190         0,79        3,21
>          2  |       3,33       1,20     3,35   0,001         1,64        6,77
>             |
>       _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
> ------------------------------------------------------------------------------
>
>
> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3
> [pweight=pond_nacional] if disdesjub==1 & disdestr==1 & trab==1 &
> dismy5
>> 0==1 & proxy==2 & edad_c>=60
>
> Logistic regression                               Number of obs   =        710
>                                                  Wald chi2(7)    =      56,60
>                                                  Prob > chi2     =     0,0000
> Log pseudolikelihood = -171,06907                 Pseudo R2       =     0,1384
>
> ------------------------------------------------------------------------------
>             |               Robust
>     discAVD | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>      edad_c |       1,10       0,02     4,58   0,000         1,06        1,15
>             |
>        sexo |
>          1  |       1,00  (base)
>          2  |       2,60       0,88     2,82   0,005         1,34        5,04
>             |
>      estud4 |
>          0  |       1,00  (base)
>          1  |       0,87       0,33    -0,37   0,715         0,42        1,82
>          2  |       0,90       0,45    -0,22   0,826         0,34        2,38
>          3  |       0,60       0,29    -1,06   0,290         0,23        1,56
>             |
>   difinmes3 |
>          0  |       1,00  (base)
>          1  |       1,59       0,54     1,37   0,169         0,82        3,10
>          2  |       3,33       1,26     3,20   0,001         1,59        6,98
>             |
>       _cons |       0,00       0,00    -6,04   0,000         0,00        0,00
> ------------------------------------------------------------------------------
>
>
>
>
> Unfortunately, esta gof does not work with pweights.
>
>
> Interestingly enough, when I run:
>
> . quietly svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 &
> dismy50==1 & proxy==2 & edad_c>=60): logistic discAVD edad_c i
>> .sexo i.estud4 i.difinmes3
> .
> end of do-file
>
> . svylogitgof
>   Number of observations =                            1727
> F-adjusted test statistic = F(9,185) =                0,800
>                 Prob > F =                           0,617
>
> -svylogitgof- seems to work with subpopolations. I remember having
> asked Kelly Archer about svylogitgof and she told me that it had been
> superseded by estat gof. I'll write her again and tell about these
> problems with estat gof.
>
> Best regards,
>
> Angel Rodriguez Laso
>
> 2013/7/22 Stas Kolenikov <[email protected]>:
>> The defaults of -estat gof- seem to be working in principally
>> different ways with and without -svy-. Without the -svy- prefix,
>> -estat gof- goes on to predict the probabilities in every covariate
>> pattern cell. If you have a continuous variable in your regression,
>> this may lead to a disaster, and it probably did in your case: I would
>> have zero trust in a chi-square with 342 degrees of freedom based on
>> 710 observations. Really? Do you know how many zero cells you have
>> there, at least???
>>
>> After -svy-, Archer-Lemeshow (note that this is a different test now,
>> documented in Stata Journal) performs essentially what -estat gof,
>> group()- does: it produces quantile groups of predicted probabilities,
>> and looks into the Pearson test based on these groups (9 in your case;
>> OK for your design with # of PSUs - # of strata = 234 - 41 = 193
>> effective degrees of freedom). Basically, Archer-Lemeshow test knows
>> that a lot of practical designs may have a low dozen degrees of
>> freedom, and it conserves the degrees of freedom and does not try to
>> go after all covariate combinations. In your example, doing so leads
>> to 342 combinations, which is simply not sustainable with 193
>> available degrees of freedom. -estat gof- is supported after -svy- in
>> Stata 12.
>>
>> I don't know if the reasons -estat gof- refuses to work after -svy,
>> subpop()- are conceptual (subsetting the sample may produce low
>> weighted count or zero cells; the denominators in the sum of cell
>> contributions in the Pearson test have random subpopulation size,
>> which needs to be properly accounted for) or programmatic (Stata Corp
>> did not expect such use, and made -estat gof- overly protective of
>> what it can work with). I can imagine that once e(V) is posted, which
>> incorporated uncertainty due to random subpopulation size, everything
>> else should follow -- which is what you do with an inappropriate -if-
>> statement. (I hope you are aware of the dangers of using -if- with
>> survey data, see
>> http://www.stata-journal.com/article.html?article=st0153.)
>>
>> -- Stas Kolenikov, PhD, PStat (ASA, SSC)
>> -- Senior Survey Statistician, Abt SRBI
>> -- Opinions stated in this email are mine only, and do not reflect the
>> position of my employer
>> -- http://stas.kolenikov.name
>>
>>
>>
>> On Wed, Jul 17, 2013 at 5:23 AM, Ángel Rodríguez Laso
>> <[email protected]> wrote:
>>> Dear Statalisters,
>>>
>>> Working with Stata 12.1.
>>>
>>>
>>> If I carry out the following logistic regression in a survey setting
>>> and then type estat gof I get:
>>>
>>>
>>> . svy, subpop(if disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 &
>>> proxy==2 & edad_c>=60): logistic discAVD edad_c i.sexo i. estud4
>>> i.difinmes3
>>> (running logistic on estimation sample)
>>>
>>> Survey: Logistic regression
>>>
>>> Number of strata   =        41                  Number of obs      =      1727
>>> Number of PSUs     =       234                  Population size    = 1347,0862
>>>                                                Subpop. no. of obs =       710
>>>                                                Subpop. size       =    563,75
>>>                                                Design df          =       193
>>>                                                F(   7,    187)    =      8,32
>>>                                                Prob > F           =    0,0000
>>>
>>> ------------------------------------------------------------------------------
>>>             |             Linearized
>>>     discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>>      edad_c |       1,10       0,02     4,42   0,000         1,05        1,15
>>>             |
>>>        sexo |
>>>          1  |       1,00  (base)
>>>          2  |       2,60       0,82     3,02   0,003         1,39        4,84
>>>             |
>>>      estud4 |
>>>          0  |       1,00  (base)
>>>          1  |       0,87       0,32    -0,38   0,704         0,43        1,78
>>>          2  |       0,90       0,40    -0,24   0,807         0,37        2,16
>>>          3  |       0,60       0,27    -1,14   0,257         0,24        1,47
>>>             |
>>>   difinmes3 |
>>>          0  |       1,00  (base)
>>>          1  |       1,59       0,57     1,31   0,190         0,79        3,21
>>>          2  |       3,33       1,20     3,35   0,001         1,64        6,77
>>>             |
>>>       _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
>>> ------------------------------------------------------------------------------
>>>
>>> .
>>> end of do-file
>>>
>>> . estat gof
>>> estat gof is not allowed after subpopulation estimations
>>> r(198);
>>>
>>>
>>>
>>> Then I change if statements for my subpopulation especifications:
>>>
>>>
>>> . svy: logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if
>>> disdesjub==1 & disdestr==1 & trab==1 & dismy50==1 & proxy==2 &
>>> edad_c>=60
>>> (running logistic on estimation sample)
>>>
>>> Survey: Logistic regression
>>>
>>> Number of strata   =        41                  Number of obs      =       710
>>> Number of PSUs     =       193                  Population size    =    563,75
>>>                                                Design df          =       152
>>>                                                F(   7,    146)    =      8,35
>>>                                                Prob > F           =    0,0000
>>>
>>> ------------------------------------------------------------------------------
>>>             |             Linearized
>>>     discAVD | Odds Ratio   Std. Err.      t    P>|t|     [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>>      edad_c |       1,10       0,02     4,41   0,000         1,05        1,15
>>>             |
>>>        sexo |
>>>          1  |       1,00  (base)
>>>          2  |       2,60       0,82     3,02   0,003         1,39        4,85
>>>             |
>>>      estud4 |
>>>          0  |       1,00  (base)
>>>          1  |       0,87       0,32    -0,38   0,707         0,42        1,79
>>>          2  |       0,90       0,40    -0,25   0,807         0,37        2,16
>>>          3  |       0,60       0,27    -1,15   0,254         0,24        1,46
>>>             |
>>>   difinmes3 |
>>>          0  |       1,00  (base)
>>>          1  |       1,59       0,56     1,32   0,189         0,79        3,21
>>>          2  |       3,33       1,18     3,39   0,001         1,65        6,72
>>>             |
>>>       _cons |       0,00       0,00    -5,88   0,000         0,00        0,00
>>> ------------------------------------------------------------------------------
>>>
>>> . estat gof
>>>
>>> Logistic model for discAVD, goodness-of-fit test
>>>
>>>                     F(9,144) =       110,29
>>>                     Prob > F =         0,0000
>>>
>>>
>>>
>>> But if I get rid of the survey especifications, I get:
>>>
>>> . logistic discAVD edad_c i.sexo i.estud4 i.difinmes3 if disdesjub==1
>>> & disdestr==1 & trab==1 & dismy50==1 & proxy==2 & edad_c>=60
>>>
>>> Logistic regression                               Number of obs   =        710
>>>                                                  LR chi2(7)      =      65,87
>>>                                                  Prob > chi2     =     0,0000
>>> Log likelihood = -210,78135                       Pseudo R2       =     0,1351
>>>
>>> ------------------------------------------------------------------------------
>>>     discAVD | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
>>> -------------+----------------------------------------------------------------
>>>      edad_c |       1,10       0,02     5,28   0,000         1,06        1,14
>>>             |
>>>        sexo |
>>>          1  |       1,00  (base)
>>>          2  |       1,96       0,56     2,36   0,018         1,12        3,44
>>>             |
>>>      estud4 |
>>>          0  |       1,00  (base)
>>>          1  |       0,87       0,29    -0,42   0,676         0,45        1,69
>>>          2  |       0,88       0,40    -0,28   0,781         0,36        2,14
>>>          3  |       0,52       0,25    -1,37   0,170         0,21        1,32
>>>             |
>>>   difinmes3 |
>>>          0  |       1,00  (base)
>>>          1  |       1,89       0,61     1,97   0,049         1,00        3,57
>>>          2  |       3,84       1,39     3,70   0,000         1,88        7,83
>>>             |
>>>       _cons |       0,00       0,00    -7,01   0,000         0,00        0,00
>>> ------------------------------------------------------------------------------
>>>
>>> . estat gof
>>>
>>> Logistic model for discAVD, goodness-of-fit test
>>>
>>>       number of observations =       710
>>> number of covariate patterns =       350
>>>            Pearson chi2(342) =       328,89
>>>                  Prob > chi2 =         0,6852
>>>
>>>
>>> The last two models don't look terribly different, so what is the
>>> reason for a such a large change in the Hosmer&Lemeshow result? Which
>>> one should I trust?
>>>
>>> Thank you for your time and attention.
>>>
>>> Angel Rodriguez-Laso
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index