Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Counting Number of Program Days [was: RE: RE: statalist-digest V4 #4323]


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   Re: st: Counting Number of Program Days [was: RE: RE: statalist-digest V4 #4323]
Date   Tue, 1 Nov 2011 23:21:22 +0000

Phil has cracked the problem nicely. 

But there's a small refinement that will appeal to some tastes. Consider the lines 

egen startdate=min(date) if attended, by(First Last)
egen enddate=max(date) if attended, by(First Last)
bysort First Last (startdate): replace startdate=startdate[1]
bysort First Last (enddate): replace enddate=enddate[1]

Here we want to calculate minimum and maximum over the attended dates and then "spread" them to the other observations for each case. (Sometime ago on the list someone gave database jargon for this operation, but I've forgotten what it was. It may have been "spreading"; if not that is just the way I think of it.)

Note that -attended- is an indicator, 1 or 0. 

A shorter way of doing it is then 

egen startdate=min(date/attended), by(First Last)
egen enddate=max(date/attended), by(First Last)

What's going on? 

1. The -egen- functions -min()- and -max()- will take expressions, which can be more complicated than variable names. 

2. If you divide by 1 or 0 the result is the numerator on division by 1 and missing on division by 0. But -- in this context and given how Stata behaves -- that is fine because that result is exactly what you want. That is because Stata ends up ignoring the missings in calculating the min and max (unless all values are missing, in which case the min and max can also only be returned as missing, which is fine too) but the min and max themselves are assigned to all pertinent observations. 

My guess at why this is not (apparently) better known is that we have all been taught that division by zero makes no sense and is a thoroughly bad idea all round, but here it is an operation that gets the right result directly. 

A more prosaic way of getting to the same place in two lines is  

egen startdate=min(cond(attended, date, .)), by(First Last)
egen enddate=max(cond(attended, date, .)), by(First Last)

See also for discussion in context 

SJ-11-2 dm0055  . . . . . . . . . . . . . .  Speaking Stata: Compared with ...
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q2/11   SJ 11(2):305--314                                (no commands)
        reviews techniques for relating values to values in other
        observations

Nick 
[email protected] 

Nicole Johnson

Hi Phil - 

Thanks so much for the reply - this definitely seems to have done it! I am still getting used to STATA dates so this was very helpful. 

From: Phil Clayton <[email protected]>

I would do this by reshaping the data to long format.

Phil


. * enter data
. clear

. input str10 First str10 Last A10_01_10 A10_05_10 A10_010_10 A10_11_10

          First        Last  A10_01_10  A10_05_10  A10_010~0  A10_11_10
  1. "Jane" "Doe" 1 1 . .
  2. "John" "Doe" . 1 0 1
  3. end

. 
. * clean up variable name/s
. * (alternatively you could clean these up after reshaping)
. rename A10_010_10 A10_10_10

. list, clean noobs

    First   Last   A10_01~0   A10_05~0   A10_10~0   A10_11~0  
     Jane    Doe          1          1          .          .  
     John    Doe          .          1          0          1  

. 
. * reshape to long format
. reshape long A, i(First Last) j(datestr) string
(note: j = 10_01_10 10_05_10 10_10_10 10_11_10)

Data                               wide   ->   long
- -----------------------------------------------------------------------------
Number of obs.                        2   ->       8
Number of variables                   6   ->       4
j variable (4 values)                     ->   datestr
xij variables:
      A10_01_10 A10_05_10 ... A10_11_10   ->   A
- -----------------------------------------------------------------------------

. rename A attended

. replace attended=0 if missing(attended)
(3 real changes made)

. list, clean noobs

    First   Last    datestr   attended  
     Jane    Doe   10_01_10          1  
     Jane    Doe   10_05_10          1  
     Jane    Doe   10_10_10          0  
     Jane    Doe   10_11_10          0  
     John    Doe   10_01_10          0  
     John    Doe   10_05_10          1  
     John    Doe   10_10_10          0  
     John    Doe   10_11_10          1  

. 
. * calculate first and final attendance dates for each person
. gen date=date(datestr, "MD20Y")

. egen startdate=min(date) if attended, by(First Last)
(4 missing values generated)

. egen enddate=max(date) if attended, by(First Last)
(4 missing values generated)

. bysort First Last (startdate): replace startdate=startdate[1]
(4 real changes made)

. bysort First Last (enddate): replace enddate=enddate[1]
(4 real changes made)

. format %td date startdate enddate 

. list, clean noobs

    First   Last    datestr   attended        date   startdate     enddate  
     Jane    Doe   10_01_10          1   01oct2010   01oct2010   05oct2010  
     Jane    Doe   10_05_10          1   05oct2010   01oct2010   05oct2010  
     Jane    Doe   10_11_10          0   11oct2010   01oct2010   05oct2010  
     Jane    Doe   10_10_10          0   10oct2010   01oct2010   05oct2010  
     John    Doe   10_05_10          1   05oct2010   05oct2010   11oct2010  
     John    Doe   10_11_10          1   11oct2010   05oct2010   11oct2010  
     John    Doe   10_10_10          0   10oct2010   05oct2010   11oct2010  
     John    Doe   10_01_10          0   01oct2010   05oct2010   11oct2010  

. 
. * for each date, could that person have attended?
. gen byte couldattend=date>=startdate & date<=enddate

. 
. * sum up the possible attendances per person
. egen maxpossible=sum(couldattend), by(First Last)

. 
. list, clean noobs

    First   Last    datestr   attended        date   startdate     enddate   coulda~d   maxpos~e  
     Jane    Doe   10_01_10          1   01oct2010   01oct2010   05oct2010          1          2  
     Jane    Doe   10_05_10          1   05oct2010   01oct2010   05oct2010          1          2  
     Jane    Doe   10_11_10          0   11oct2010   01oct2010   05oct2010          0          2  
     Jane    Doe   10_10_10          0   10oct2010   01oct2010   05oct2010          0          2  
     John    Doe   10_05_10          1   05oct2010   05oct2010   11oct2010          1          3  
     John    Doe   10_11_10          1   11oct2010   05oct2010   11oct2010          1          3  
     John    Doe   10_10_10          0   10oct2010   05oct2010   11oct2010          1          3  
     John    Doe   10_01_10          0   01oct2010   05oct2010   11oct2010          0          3  

. 
. * or instead of the last egen you could just collapse the dataset
. collapse (sum) couldattend, by(First Last)

. list, clean noobs

    First   Last   coulda~d  
     Jane    Doe          2  
     John    Doe          3  

. 



On 01/11/2011, at 1:45 PM, Nicole Johnson wrote:

> Hi all,
> 
> I have a dataset that is basically set up like an attendance roll book. It has the person's name and then each variable is a date that the program was held. The person has a 1 if they attended that day. It looks like this:
> 
> First                       Last                        A10_01_10          A10_05_10          A10_010_10       A10_11_10
> Jane                       Doe                        1                             1                              .                               .
> John                      Doe                        .                               1                              0                              1
> 
> The records go from October through June, but the program did not meet every day. As noted above, the variable names indicate the date. I was able to use a loop to extract the date of first attendance and last attendance, but I need to now calculate the total number of days the person 'could' have attended the program between their date of first attendance and date of last attendance.  SO in the above example I would be able to say that John Doe attended 2 out of 3 possible program days. Of course since the data in my dataset has many more dates, this is much harder! Any help is appreciated. 
> 
> I guess I should mention I used the following to calculate some additional variables that may be of use which include string values for date first attended that match the variable names and date values, also the total number of program days. 
> 
> Any help is much appreciated - thank you!
> Nikki
> 
> ***Macro to find first date of attendance and create string variable 'firstfound'
> local first 1
> gen firstfound = ""
> foreach v of varlist A10_01_2008-A06_20_2009 {
>                replace firstfound = "`v'" if `v' == `first' & missing(firstfound)
> }
> 
> ***Macro to find last date of attendance and create string variable 'lastfound'
> local last 1
> gen lastfound = ""
> foreach v of varlist A10_01_2008-A06_20_2009 {
>                replace lastfound = "`v'" if `v' == `last'
> }
> 
> ***Transforming string 'firstfound' into date value first_attend_0809
> . gen firstfound1=substr(firstfound, 2, 10)
> . generate first_attend_0809=date(firstfound1,"MDY")
> . format first_attend_0809 %td
> 
> ***Transforming string 'lastfound' into date value last_attend_0809
> . gen lastfound1=substr(lastfound, 2, 10)
> . generate last_attend_0809=date(lastfound1,"MDY")
> . format last_attend_0809 %td
> 
> local start firstfound
> gen days_possible = 0
> foreach v of varlist A10_01_2008-A06_20_2009 {
>                replace days_possible = days_possible+1
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index