Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: identifying age-matched controls in a cohort study

From	Phil Clayton <[email protected]>
To	[email protected]
Subject	Re: st: identifying age-matched controls in a cohort study
Date	Fri, 23 Aug 2013 18:52:41 +1000
More efficient version:

* example cohort data
* 1000 exposed and 20000 unexposed people
* study period is 01jan2005 to 31dec2009
clear
set obs 21000
gen id=_n

* random date of exposure during the study period for first 100 people
set seed 12345
gen expdate=td(01jan2005) + trunc(runiform()*5*365.25) if id<=1000
format %td expdate

* in this simulation we'll assume that the exposed patients are 1.2x as likely
* to die as the unexposed
* we'll make 20% of the unexposed die, so 240 of the exposed patients will get a
* death date - but if it's before their expdate we'll not count it
gen deathdate=td(01jan2005) + trunc(runiform()*5*365.25) if id<=240
replace deathdate=. if deathdate<=expdate

* random death date for 20% of the other 20000 people
* (assume the rest are alive at the end of 2009)
replace deathdate=td(01jan2005) + trunc(runiform()*5*365.25) if inrange(id, 1001, 5000)
format %td deathdate

* age at start of study
gen age=rnormal(50, 5)

**** we have now finished constructing our dummy dataset ****

* now we classify patients as ever-exposed vs never-exposed
gen byte exposed=!missing(expdate)

* we need to know who's available for each match run
* at the start everyone is available
gen byte available=1

* pair is a new variable indicating each matched pair
* initially it's the patient id, but for matched cases it will be replaced with
* the control's id
gen pair=id

* now iteratively match until everyone's been matched to 1 neighbour
local finished=0
while `finished'==0 {
	* randomly sort the data before running -psmatch- (in case of ties)
	gen double rsort=runiform()
	sort rsort
	drop rsort
	
	psmatch2 exposed if available, pscore(age) n(1) noreplace
	
	* -psmatch2- creates an ID variable for everyone called _id
	* for matched treated patients, the untreated match's ID is stored in
	* the treated patient's _n1 variable
	* anyone matched has a _weight of 1
	sort _id
	
	* the pair becomes the control's ID for exposed cases
	* and then these cases are no longer "available" for future matching
	quietly replace pair=id[_n1] if exposed & _weight==1 & deathdate[_n1]>expdate
	quietly replace available=0 if exposed & _weight==1 & deathdate[_n1]>expdate
	
	* anyone who was a matched control in this run should be made unavailable for further
	* matching
	* (to prevent endlessly matching exposed patients with dead ones)
	quietly replace available=0 if !exposed & _weight==1
	
	* see if we need to do any more match runs
	quietly count if exposed & available
	display "Patients still needing a match: " r(N)
	if r(N)==0 local finished=1
}

* any pair with 2 observations is now a matched pair
* the others are unmatched
bysort pair: gen byte touse=_N==2

* confirm that we now have 1000 pairs
tab exposed if touse

* confirm that the ages are well matched
tabstat age if touse, by(exposed) s(n mean sd q)
bysort pair (exposed): gen agediff=age[1] - age[2] if touse
sum agediff, d

* confirm that the death dates of the controls are after the exposure dates
bysort pair (exposed): assert deathdate[1]>expdate[2] if touse

* now we can set up a survival analysis
* start date is the date of exposure and end date is death or 31dec2009
bysort pair (exposed): gen start=expdate[2] if touse
gen end=deathdate
replace end=td(31dec2009) if missing(end)
gen byte died=!missing(deathdate)
stset end, fail(died) origin(time start) scale(365.25) if(touse)
sts graph, by(exposed)
stcox exposed


On 23/08/2013, at 5:40 PM, Phil Clayton <[email protected]> wrote:

> There are different approaches. I use -psmatch2- (SSC) because it's quite convenient and fast. It's also trivial to extend your matching to use a propensity score rather than a single variable.
> 
> You don't need to calculate the age at exposure - you can just match on age at the start of the study (or even date of birth).
> 
> If someone was exposed part-way through the study, do you want to allow them to be a non-exposed control for someone who was exposed earlier?
> 
> If not, you could use an iterative loop to match exposed patients until they've all been matched to a living unexposed control. Here is an example. It's probably not the most efficient way of doing it but it still doesn't take too long.
> 
> Phil
> 
> * example cohort data
> * 1000 exposed and 20000 unexposed people
> * study period is 01jan2005 to 31dec2009
> clear
> set obs 21000
> gen id=_n
> 
> * random date of exposure during the study period for first 100 people
> set seed 12345
> gen expdate=td(01jan2005) + trunc(runiform()*5*365.25) if id<=1000
> format %td expdate
> 
> * in this simulation we'll assume that the exposed patients are 1.2x as likely
> * to die as the unexposed
> * we'll make 20% of the unexposed die, so 240 of the exposed patients will get a
> * death date - but if it's before their expdate we'll not count it
> gen deathdate=td(01jan2005) + trunc(runiform()*5*365.25) if id<=240
> replace deathdate=. if deathdate<=expdate
> 
> * random death date for 20% of the other 20000 people
> * (assume the rest are alive at the end of 2009)
> replace deathdate=td(01jan2005) + trunc(runiform()*5*365.25) if inrange(id, 1001, 5000)
> format %td deathdate
> 
> * age at start of study
> gen age=rnormal(50, 5)
> 
> **** we have now finished constructing our dummy dataset ****
> 
> * now we classify patients as ever-exposed vs never-exposed
> gen byte exposed=!missing(expdate)
> 
> * we need to know who's available for each match run
> * at the start everyone is available
> gen byte available=1
> 
> * pair is a new variable indicating each matched pair
> * initially it's missing for everyone
> gen pair=.
> 
> * now iteratively match until everyone's been matched to 1 neighbour
> local finished=0
> while `finished'==0 {
> 	* randomly sort the data before running -psmatch- (in case of ties)
> 	gen double rsort=runiform()
> 	sort rsort
> 	drop rsort
> 	
> 	psmatch2 exposed if available, pscore(age) n(1) noreplace
> 	
> 	* -psmatch2- creates an ID variable for everyone called _id
> 	* for matched treated patients, the untreated match's ID is stored in
> 	* the treated patient's _n1 variable
> 	* anyone matched has a _weight of 1
> 	sort _id
> 	
> 	* the pair becomes the control's ID for exposed cases
> 	* and then these cases are no longer "available" for future matching
> 	quietly replace pair=id[_n1] if exposed & _weight==1 & deathdate[_n1]>expdate
> 	quietly replace available=0 if exposed & _weight==1 & deathdate[_n1]>expdate
> 	
> 	* now loop through the matched controls and update their pair variable to
> 	* become their ID (there is probably a more efficient way to do this)
> 	qui levelsof _n1 if exposed & _weight==1 & deathdate[_n1]>expdate, local(matches)
> 	qui foreach id of local matches {
> 		quietly replace pair=id if _id==`id'
> 	}
> 	
> 	* anyone who was a matched control in this run should be made unavailable for further
> 	* matching
> 	* (to prevent endlessly matching exposed patients with dead ones)
> 	quietly replace available=0 if !exposed & _weight==1
> 	
> 	* see if we need to do any more match runs
> 	quietly count if exposed & available
> 	display "Patients still needing a match: " r(N)
> 	if r(N)==0 local finished=1
> }
> 
> * pairs now have a "pair" variable; these are the observations we want to use
> gen byte touse=!missing(pair)
> 
> * confirm that we now have 1000 pairs
> tab exposed if touse
> 
> * confirm that the ages are well matched
> tabstat age if touse, by(exposed) s(n mean sd q)
> bysort pair (exposed): gen agediff=age[1] - age[2] if touse
> sum agediff, d
> 
> * confirm that the death dates of the controls are after the exposure dates
> bysort pair (exposed): assert deathdate[1]>expdate[2] if touse
> 
> * now we can set up a survival analysis
> * start date is the date of exposure and end date is death or 31dec2009
> bysort pair (exposed): gen start=expdate[2] if touse
> gen end=deathdate
> replace end=td(31dec2009) if missing(end)
> gen byte died=!missing(deathdate)
> stset end, fail(died) origin(time start) scale(365.25) if(touse)
> sts graph, by(exposed)
> stcox exposed
> 
> 
> 
> 
> On 21/08/2013, at 6:49 AM, "Smit, Menno" <[email protected]> wrote:
> 
>> Dear all,
>> 
>> I am analysing data from a large cohort study in which some individuals become exposed during the 5 year observation period. For each exposed individual, how can I identify the nearest age-matched, unexposed individual that is alive on the date that the exposed become exposed?
>> 
>> Many thanks,
>> Menno
>> 
>> MD in Tropical Medicine “Mother & Child Health”
>> Research Assistant in Malaria Epidemiology
>> KEMRI/CDC, P.O.Box 1578, Kisumu 40100, Kenya.
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
References:
- st: identifying age-matched controls in a cohort study
  - From: "Smit, Menno" <[email protected]>
- Re: st: identifying age-matched controls in a cohort study
  - From: Phil Clayton <[email protected]>
Prev by Date: Re: st: identifying age-matched controls in a cohort study
Next by Date: st: Request for urgent response
Previous by thread: Re: st: identifying age-matched controls in a cohort study
Next by thread: st: Main effect for time-varying covariate
Index(es):
- Date
- Thread