Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Interval-censored analysis vs grouped analysis was: -cloglog

From	Steve Samuels <[email protected]>
To	[email protected]
Subject	st: Interval-censored analysis vs grouped analysis was: -cloglog
Date	Sun, 18 Aug 2013 14:34:10 -0400
Just to give some background for those new to the conversation:

Pradip is analyzing public use files from the linked mortality file for
the US National Health Interview Survey (NHIS). Adults in the 1997-2004
(or 2006 - his description has varied) surveys were matched to deaths
recorded in the US National Death Index through December 31, 2006. To
ensure privacy, only year and quarter of death are given in the Public
Use File See:
http://www.cdc.gov/nchs/data_access/data_linkage/mortality/
nhis_linkage_public_use.htm.


He is using svy: -cloglog- to do a proportional hazards analysis of
grouped data. There are some problems if one uses the first year and
quarter as the "zero" interval, because deaths that take place in that
interval will be dropped by -stset-. (This is a well-known problem of
grouping See: Bergston and Edin, 1992.) It's possible to retain these
deaths by assuming that the interview took place at the start of the
quarter: just add a "zero" period for all people.

Ref: Bergström, Reinhold, and P‐A Edin. 1992. Time aggregation and the
distributional shape of unemployment duration. Journal of Applied
Econometrics 7, no. 1: 5-30.

Interval censored regression. It turns out that the data set also
contains the "week" of the quarter in which interviewing was to start,
with completion required within ~15 days (in fact most take place within
7 days). Thus one can bound the time to death by the times from day 1 of
interval week to quarter start and to quarter end. The length of the
censoring interval will be 1 quarter or 12 weeks. In contrast, in an
analysis based on quarters of interview and death the time to death has
a possible range of 24 weeks, with average 12. Thus -interval censored-
analysis should be considered.

Attained age: Age at interview is known. Pradip wants to use attained
age as a time scale for the survival models, as this is a natural
classifier for mortality rates. However, because birthday is unknown,
there is no way to assign age to any year and quarter with certainty.
Assigned ages will be off by 1 year in about half the people.

Now, read on:

Steve

> On Aug 18, 2013, at 7:44 AM, Muhuri, Pradip (SAMHSA/CBHSQ) wrote:
> 
> Dear Statalist,
> 
> This is in response to Steve's recommendation to use -stpm2- (SSC) by Patrick Royston for my analysis of the complex survey designed-based National Health Interview Survey Linked Mortality file (NHIS-LMF) data 
> (http://www.cdc.gov/nchs/data_access/data_linkage/mortality/nhis_linkage.htm), which include the "quarter and year" for the time of interview and the time of death (e.g. quarter 1, 2001) and the "exact date" for censored cases. 
> 
> I appreciated receiving Steve's comments and wanted to thank him for his time.
> 
> Below are my specific responses and observations.
> 
> 1) I am not sure whether the -svy- prefix accepts the -stmp2- command. If it does not and  given that the event date is expressed in terms of quarter /year in the NHIS-LMF data, the complementary log log (CLL) regression analysis (-cloglog-) on "spell" data using Professor's Jenkins' resources https://www.iser.essex.ac.uk/files/teaching/stephenj/ec968/pdfs/ec968st6.pdf
> ), which Steve initially recommended (http://www.stata.com/statalist/archive/2013-07/msg01158.html), could still be a better alternative to the proportional hazards model approach.  
> 

I agree that -cloglog- is a good way to go, though you still have to add
the zero point.

I recommended -stpm-, not -stpm2-. Both accept probability weights
specified in -stset-, but only -stpm- has a cluster() option and will do
interval censoring. -stpm- is not survey-aware, so it has two
disadvantages: a) It can't take a sampling strata() option. In many
surveys, this would make little difference to the resulting standard
errors. b) It won't properly estimate standard errors for
subpopulations.

> 
> 2) From the NHIS-LMF citation list
> (http://www.cdc.gov/nchs/data/datalinkage/nhis_mort_cit_list.pdf), I
> find several published articles that employed Cox proportional hazards
> models although the time scale included in the data files is not
> continuous (the event date is the quarter/year scale).
> 
Just because the articles were published, doesn't mean that bias wasn't
present.

> 3) I agree with Steve: measurement errors associated with "attained age"
> in the NHIS-LMF data are real. However, whether I use "attained age" or
> "age at interview", the death rates by 10-year age group in the -svy
> mean- analysis or the hazard rates from the -margins- command after
> -cloglog- for the same 10-category age variable in the CLL model differ
> little by the age construct used. Results have not been shown here.
> 


It's not surprising that 10 year age groupings would be little affected
by misclassification error. However your goal is a prediction model.
For prediction purposes, such age groups are to wide to be informative.

You tried to assign "fractional ages" of 51.0, 51.25, 51.50 to each
quarter. These assume that the subject's birthday was at the date of
interview, which is clearly wrong. 

A more serious issue: Some predictors measured at interview
could change subsequently.  Their use as predictors of mortality
at attained age, i.e. current age, is of questionable value.

For example, it's well-known that unmarried or widowed men have higher
risks of death. If a man who was married at interview later lost his
spouse, his marital status would be misclassified for all ages
subsequent to the loss. For this predictor, I would expect that
classification will be differential.


> 4) Some comparison: Model-free death rates (-svy means-) and CLL
> model-based death rates (-margins-)
> 
> 	a) The death rates by 10-year attained-age group, calculated from
> the 1997-2006 NHIS-LMF 	person-year "spells" data (shown below), seem
> to be close to the death rates published 	for 2011 (Table 1, Page 8
> http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_06.pdf , also 	shown
> below).
> 
> 	b) From the 1997-2006 NHIS-LMF data, the CLL model-based death rates
> are about the same as 	the rates from -svy- means presented below.
> 
See my remark above about grouping.  Means for younger age groups were not that
close.

> 5) Steve's idea of using  the "split validation sample approach" is a good one.
> 
> 6) Constructing the date of NHIS interview from the assignment week could be very different from the date constructed from quarter and year of interview.  Some interviews assigned in the last week of a given quarter might have actually taken place in the following quarter.

Two misunderstandings here:

1. I'm not sure what constructed dates you refer to. The interval-censoring analysis
gets a good lower bound on the date of interview.

2. To get a valid censoring interval, all that is necessary is to identify the earliest date on which the the interview could have occurred. The interview
week is identified by its first day. The fact it continued into the next quarter is irrelevant. 

> 
> Any further comments and thoughts are welcome.
> 
> Thanks,
> 
> Pradip K. Muhuri
> 
> *************************  LOG BEGINS - Details for 4a *******************************
> 
> . svyset psu [pweight=wt8], strata (stratum) vce(linearized) singleunit(missing)
> 
>      pweight: wt8
>          VCE: linearized
>  Single unit: missing
>     Strata 1: stratum
>         SU 1: psu
>        FPC 1: <zero>
> 
> .      svy: mean dead100k, over (xa_age_grp)
> (running mean on estimation sample)
> 
> Survey: Mean estimation
> 
> Number of strata =     339      Number of obs    =     1337154
> Number of PSUs   =     678      Population size  =  1088280599
>                                Design df        =         339
> 
>    _subpop_1: xa_age_grp = 25-34 Yrs
>    _subpop_2: xa_age_grp = 35-44 Yrs
>    _subpop_3: xa_age_grp = 45-54 Yrs
>    _subpop_4: xa_age_grp = 55-64 Yrs
>    _subpop_5: xa_age_grp = 65-74 Yrs
>    _subpop_6: xa_age_grp = 75-84 Yrs
> //This line is my addition to the output: 1997-2006 (NHIS-LMF Data)
> --------------------------------------------------------------
>             |             Linearized
>        Over |       Mean   Std. Err.     [95% Conf. Interval]
> -------------+------------------------------------------------
> dead100k     |
>   _subpop_1 |   78.66841   6.446016      65.98919    91.34764
>   _subpop_2 |    163.239   7.328281      148.8244    177.6536
>   _subpop_3 |   383.6766   12.99239      358.1208    409.2325
>   _subpop_4 |     844.95   24.33145      797.0903    892.8096
>   _subpop_5 |   2056.149   41.04622      1975.411    2136.886
>   _subpop_6 |   4644.522   71.03923      4504.788    4784.255
> 
> NCHS, Death Rates per 100,000 for 2011 (url cited above)
> 
> 25-34 years    	104.4 
> 35-44 years    	171.7
> 45-54 years   	409.2
> 55-64 years   	848.7
> 65-74 years     1,845.0
> 75-84 years     4,750.3
> 
> ********************** LOG CONTINUES - Details for 4b *******************************
> 
> . capture noisily svy, subpop(if a_age>=25 & a_age<=84): cloglog dead i.xa_age_grp , eform nolog
> 
> margins, at(xa_age_grp= (1 2 3 4 5 6))
> 
> Adjusted predictions                              Number of obs   =    1337154
> Model VCE    : Linearized
> 
> Expression   : Pr(dead), predict()
> 
> 1._at        : xa_age_grp      =           1
> 2._at        : xa_age_grp      =           2
> 3._at        : xa_age_grp      =           3
> 4._at        : xa_age_grp      =           4
> 5._at        : xa_age_grp      =           5
> 6._at        : xa_age_grp      =           6
> 
> ------------------------------------------------------------------------------
>             |            Delta-method
>             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
> -------------+----------------------------------------------------------------
>         _at |
>          1  |   .0007867   .0000645    12.20   0.000     .0006603     .000913
>          2  |   .0016324   .0000733    22.28   0.000     .0014888     .001776
>          3  |   .0038368   .0001299    29.53   0.000     .0035821    .0040914
>          4  |   .0084495   .0002433    34.73   0.000     .0079726    .0089264
>          5  |   .0205615   .0004105    50.09   0.000      .019757     .021366
>          6  |   .0464452   .0007104    65.38   0.000     .0450529    .0478376
> 
> ************************** LOG ENDS HERE ******************************************
> 
> 


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Steve Samuels
Sent: Saturday, August 17, 2013 4:50 PM
To: [email protected]
Subject: Re: st: -svy:cloglog-

Pradip:

I've now concluded that the grouped data approach has serious flaws for your data and that interval-censored estimation is better. 
Here are my best recommendations:

* Use a computer with more memory. 

* Abandon "attained age" as the time variable. Use time from interview instead. Because you don't know birth day,you introduce serious measurement error. You really do not know how to assign age to each interval after followup. Someone who is 54 at interview could turn 55 the next day, or 55 in 364 days. The absolute average error in assigning attained age is > 6 months, which is about 4 x the average error for the proposal below. Use age at interview as a baseline covariate.

* Use a split validation sample approach: develop a model on a test sample of 50%, and evaluate it in the "validation" sample. 

* Use -stpm- (SSC) by Patrick Royston. It can accommodate interval-censored, weighted, clustered data. It is documented in Royston, Patrick. 2001. Flexible parametric alternatives to the Cox model, and more. Stata Journal 1, no. 1: 1-28, which is available for free download at: http://www.stata-journal.com/article.html?article=st0001

* Before you post, reread the Statalist FAQ Section 3. If you have a question, try to find the answer yourself (-help-, manual, -search-,
-google-) before posting.

Now for the details:

1. Define the earliest possible date of interview from the assignment week variable in your data, as I suggested in http://www.stata.com/statalist/archive/2013-08/msg00253.html.
Call it idate

2. For deaths, define the earliest possible date of death as the first day of the quarter of death. Define latest possible date of death as the last day of the quarter. Call these ddate1 and ddate2 respectively.

You'll have to create Stata "quarterly dates" first, then get the first and last day of each. If you can't figure out how to do this from the Manual, ask the list.

3. Create the earliest and latest possible date of death, as: 
t1 = ddate1 -idate, t2 = ddtate2 - idate

4. For uncensored observations, define t2 = last day of the quarter.

This puts the units for your outcome in days.

Here is an example of -stpm- in action. Note that every failure is interval-censored. In your own use, you might have to try different knot, degree of freedom, or technique() options


Steve

/*********CODE BEGINS**********************/ set more off sysuse auto, clear gen t1 = price //lower point of censoring interval set seed 0314 gen t2 = t1 + 1 + ceil(20*runiform()) //upper end point stset t2 [pw = rep78], failure(foreign)

gen str2 mkr = substr(make,1,2)
egen psu = group(mkr)
stpm  price ,scale(hazard) cluster(psu) ///
 left(t1) df(3)
/***********CODE ENDS*********************/ 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: Predicting y(i)--margins or predict?
  - From: [email protected]
- Re: st: Predicting y(i)--margins or predict?
  - From: Maarten Buis <[email protected]>
- st: -svy:cloglog- and margins
  - From: "Muhuri, Pradip (SAMHSA/CBHSQ)" <[email protected]>
- Re: st: -svy:cloglog-
  - From: Steve Samuels <[email protected]>
- RE: st: -svy:cloglog-
  - From: "Muhuri, Pradip (SAMHSA/CBHSQ)" <[email protected]>
Prev by Date: Re: st: how to do subsampling in stata
Next by Date: st: where is StataCorp C code located? all in a single executable as compiled binary?
Previous by thread: RE: st: -svy:cloglog-
Next by thread: Re: st: Predicting y(i)--margins or predict?
Index(es):
- Date
- Thread