[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: help needed on discrete-time hazard model

From   "Steichen, Thomas J." <>
To   <>
Subject   RE: st: help needed on discrete-time hazard model
Date   Thu, 18 Oct 2007 16:19:59 -0400

I see nothing wrong with the data generation steps you performed,
so the question is whether this model makes sense.

First, I will speculate that you have brand-specific prices at
the time of each wave. Since cigarette prices tend to rise
fairly uniformly between brands over time, either due to 
manufacturer price increases due to inflation or government tax 
increases, there is almost certainly a meaningful correlation 
between wave and price. Thus, having both a "price" variable and 
one or more "wave" variables will lead to confusion in the 

In this model, the "wave2" variable can be thought of as estimating 
the average quit rate differential from the missing wave (wave 1)... 
and this includes an average price differential effect. Likewise,
"wave3" estimates the average quit rate differential of wave 3 from 
wave 1. 

So what does "price" itself estimate in this model? I'd speculate 
it really only estimates how specific brands affect quitting. 
In your logit model, I'd guess that it indicates that subjects 
who smoke higher-than-average-priced brands quit at a lower rate. 
Said differently, those who smoke low-priced brands are more likely 
to quit due to a price increase. However, without knowing exactly
what your variables represent, I can't go beyond speculation.

I'm less clear why it remains negative when you take the wave 
variables out. If real, it implies that price differential (if
it truly has a positive effect on quitting) wasn't great enough to 
overcome other, competing but correlated issues (not explained by 
any other variable in the model)that caused smokers to continue 
smoking during this time period. If so, price represents the 
increase in ALL of these issues and the ones for continued smoking 
dominated the result.

On a different issue, using or not using the svy: prefix should
change the estimated coefficients, so no particular importance
should be placed on the fact that a coefficient changed signs
between these two. Without the prefix, you are estimating what 
happened for the specific group of subjects surveyed in this study. 
When you add the weighting via the svy: prefix, you change the 
importance of those individual subjects based on their sampling 

For example, you may have surveyed specific subjects who quit
but represent only a very, very small part of the overall population.
If you don't use the survey weights, their behavior may have
a large effect on the sample results but little effect on the 
population results, even to the point of sign reversal.

On yet another issue, marking pattern SQS as a successful "quit"
seems possibly misleading. Clearly, if price continued to rise
over the time period between waves (which seems likely to me), 
prices were higher in wave 3 than wave 2, yet these individuals 
started smoking again. This seems to suggest that price was not
the most important motivating factor for quiting in wave 2 (or
restatring in wave 3). One can argue that you should code these
subjects as at "risk" for all three waves and as failing to quit.

-----Original Message-----
From: [] On Behalf Of Lili Yan
Sent: Thursday, October 18, 2007 2:25 PM
Subject: Re: st: help needed on discrete-time hazard model

Hi Thomas,

Thank you very much for helping out!
I know little about this model, so I thought the two zeros indicate
something wrong in the data. The e(N) is correct, which I am sure.

Here are some codes of setting up the data. I need explain first that
smok_stat = 1 for SSS, 2 for SSQ, 3 for SQS and 4 for SQQ. start here................

gen smk_time=3 if smok_stat==1 | smok_stat==2;
replace smk_time=2 if smok_stat==3 | smok_stat==4;

gen cessyear=2004 if smok_stat==1;
replace cessyear=2004 if smok_stat==2;
replace cessyear=2003 if (smok_stat==3 | smok_stat==4);

expand smk_time;
bysort uniqid: gen seqvar=_n;
bysort uniqid: gen qtsmok=smok_stat>1 & _n==_N;

bysort uniqid: gen evntyear=cessyear;
replace evntyear=2002 if seqvar==1;
replace evntyear=2003 if seqvar==2;
drop cessyear;
rename evntyear cessyear;

gen wave=1 if cessyear==2002;
replace wave=2 if cessyear==2003;
replace wave=3 if cessyear==2004;

gen wave1=wave==1;
gen wave2=wave==2;
gen wave3=wave==3;

svy: logit qtsmok male age married white mdrt_educ high_educ incm_mdrt
incm_high canada rPSPPPi wave2 wave3, noconstant end here..........

Here is the output:

..............output starts here................
Survey: Logistic regression

Number of strata   =        26                  Number of obs      =      5642
Number of PSUs     =      5642                  Population size    = 5773.9291
                                                Design df          =      5616
                                                F(  12,   5605)    =    166.35
                                                Prob > F           =    0.0000

             |             Linearized
      qtsmok |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        male |  -.1715913   .1273081    -1.35   0.178    -.4211643    .0779817
         age |  -.0326805   .0053098    -6.15   0.000    -.0430898   -.0222713
     married |   .0156776   .1427494     0.11   0.913    -.2641663    .2955215
       white |  -.5607068   .1443603    -3.88   0.000    -.8437088   -.2777048
   mdrt_educ |  -.0291425   .1441877    -0.20   0.840    -.3118061    .2535212
   high_educ |   .5113156   .1800797     2.84   0.005     .1582899    .8643414
   incm_mdrt |  -.0339146   .1557743    -0.22   0.828    -.3392925    .2714632
   incm_high |   .1405313   .1766122     0.80   0.426    -.2056968    .4867595
      canada |   1.802811   .2552666     7.06   0.000      1.30239    2.303233
     rPSPPPi |  -.0083975    .000842    -9.97   0.000    -.0100481   -.0067468
       wave2 |   2.111112   .1326945    15.91   0.000     1.850979    2.371244
       wave3 |   2.411039   .1389374    17.35   0.000     2.138668     2.68341
....................output ends here..............

The rPSPPPi is our price variable. We have more price variables but
logit results with them are similar to what reported here.

Thank you very much!


On 10/18/07, Steichen, Thomas J. <> wrote:
> Why do you consider this an indication of something wrong?
> Having zero completely determined successes e(N_cds) and failures
> e(N_cdf) is what you prefer.
> Is your overall # of  records e(N) wrong?
> Show us some sample commands and output so we can see what you are doing.
> -----Original Message-----
> I checked the data just now. After running logit model with our
> dependent variable, the stored results show:
> e(N) = 5463
> e(N_cds) = 0
> e(N_cdf) = 0
> So seems there is something wrong in the data setup. Could anyone
> please give me some help?
> -----------------------------------------
> CONFIDENTIALITY NOTE: This e-mail message, including any
> attachment(s), contains information that may be confidential,
> protected by the attorney-client or other legal privileges, and/or
> proprietary non-public information. If you are not an intended
> recipient of this message or an authorized assistant to an intended
> recipient, please notify the sender by replying to this message and
> then delete it from your system. Use, dissemination, distribution,
> or reproduction of this message and/or any of its attachments (if
> any) by unintended recipients is not authorized and may be
> unlawful.
> *
> *   For searches and help try:
> *
> *
> *
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index