Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Handling pharmacy data with multiple entries per subject

From   Phil Schumm <>
Subject   Re: st: Handling pharmacy data with multiple entries per subject
Date   Mon, 13 Jun 2011 11:15:25 -0500

On Jun 10, 2011, at 5:05 PM, Doernberg, Sarah wrote:
Thank you for your responses. I'm attempting to be more specific about my question below. My data is currently formatted as follows (with an example entry):

ID     drug_name     start_date      stop_date        outcome
1       ceftriaxone      5/15/2001     5/17/2001       5/31/2001
1       ceftriaxone      5/19/2001     5/20/2001       5/31/2001
1       ceftriaxone      5/20/2001    5/24/2001        5/31/2001
1       ceftriaxone      7/24/2001     7/27/2001       .
This one person had 3 prescriptions for ceftriaxone during a hospitalization in May, 2001, including one day where the person was not given this drug (5/15-5/17 and 5/19-5/24). There was another hospitalization in July, 2001, where another prescription was given. The patient did not experience the outcome during the second hospitalization. The dataset only contains information about two different drugs.

My ultimate goal is to figure out the number of days each person received each drug during a 30-day period from the first day of receipt or before the date of the outcome (if <30 days from the start of antibiotic) to allow for a logistic regression with exposure = antibiotic days. In addition, I may also do a survival analysis using start of antibiotic as the start date, development of the outcome as failure, and censoring at 30 days in those without the outcome.

The first thing to do with a dataset like this is to run some checks to get a sense of what you're dealing with. For example (assuming that start_date, stop_date and outcome are all Stata-format dates):

    ass start_date <= stop_date
    isid id drug_name start_date
bys id drug_name (start_date): ass stop_date[_n-1] <= start_date if 1<_n

Taken together, these will verify that your start/stop dates are properly ordered, and that the periods of administration for a given drug do not overlap. There may be other things that you'll want to check too. Whether you "fix" any problems that show up (or ask for an updated data file) first or attempt to code around them will depend on the situation.

Often with problems like this, generating summaries is easier (or at least more readable) if you transform the data so that you have one record for each day. You may not be able to do this if you have a really big dataset, but don't be afraid to use your RAM if you've got it. For example, the following code will expand out your periods so that you have one record for each day of drug administration (separately by drug):

    expand stop_date - start_date + 1
    bys id drug_name start_date: gen date = start_date if _n==1
bys id drug_name start_date (date): replace date = date[_n-1] + 1 if 1<_n

Now, you probably want a record corresponding to the date of the outcome (if you don't already have one), which you can generate (for a single drug, using the data in your example) with:

    gen fail = (outcome == date)
bys id outcome (fail): gen byte n = (_n==1 & !fail[_N] & ! mi(outcome)) + 1
    expand n, gen(added)
    replace date = outcome if added
    replace drug_name = "" if added
    replace fail = 1 if added

Finally, you may need to clean up any duplicates that were generated above:

    keep id drug_name date fail
    duplicates drop
    isid id date

From the resulting dataset, you can easily get what you need for a survival analysis (using cumulative drug exposure as a time-varying covariate), or generate the 30-day periods you describe. You'll probably need information on the dates of hospital admission and discharge to do this.

Note that for simplicity, I have essentially ignored the fact that you have multiple drugs here (except for the obvious stratification by drug_name in the first two code blocks). Ultimately, you'll probably want to have one record per date on which *any* drug was administered (plus additional records, if necessary, for dates of hospital admission and discharge, and date of outcome), with a separate 0/1 column for each drug indicating whether that drug was administered on that day. One way to achieve this would be to do what I have shown above separately for each of your two drugs, and then merge the results together (by id and date). The outcome data (as you have shown them) are in an odd format, and how you handle them ultimately will depend on what they look like for the other drug (i.e., are the same outcome dates repeated)? It may well be easiest just to peel those off and deal with them separately (like the two drugs), and then merge the outcome dates back on at the end.

Hopefully this'll give you some ideas/strategies that you can use.

-- Phil

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index