Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <n.j.cox@durham.ac.uk> |

To |
"'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Counting Number of Program Days [was: RE: RE: statalist-digest V4 #4323] |

Date |
Tue, 1 Nov 2011 23:21:22 +0000 |

Phil has cracked the problem nicely. But there's a small refinement that will appeal to some tastes. Consider the lines egen startdate=min(date) if attended, by(First Last) egen enddate=max(date) if attended, by(First Last) bysort First Last (startdate): replace startdate=startdate[1] bysort First Last (enddate): replace enddate=enddate[1] Here we want to calculate minimum and maximum over the attended dates and then "spread" them to the other observations for each case. (Sometime ago on the list someone gave database jargon for this operation, but I've forgotten what it was. It may have been "spreading"; if not that is just the way I think of it.) Note that -attended- is an indicator, 1 or 0. A shorter way of doing it is then egen startdate=min(date/attended), by(First Last) egen enddate=max(date/attended), by(First Last) What's going on? 1. The -egen- functions -min()- and -max()- will take expressions, which can be more complicated than variable names. 2. If you divide by 1 or 0 the result is the numerator on division by 1 and missing on division by 0. But -- in this context and given how Stata behaves -- that is fine because that result is exactly what you want. That is because Stata ends up ignoring the missings in calculating the min and max (unless all values are missing, in which case the min and max can also only be returned as missing, which is fine too) but the min and max themselves are assigned to all pertinent observations. My guess at why this is not (apparently) better known is that we have all been taught that division by zero makes no sense and is a thoroughly bad idea all round, but here it is an operation that gets the right result directly. A more prosaic way of getting to the same place in two lines is egen startdate=min(cond(attended, date, .)), by(First Last) egen enddate=max(cond(attended, date, .)), by(First Last) See also for discussion in context SJ-11-2 dm0055 . . . . . . . . . . . . . . Speaking Stata: Compared with ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q2/11 SJ 11(2):305--314 (no commands) reviews techniques for relating values to values in other observations Nick n.j.cox@durham.ac.uk Nicole Johnson Hi Phil - Thanks so much for the reply - this definitely seems to have done it! I am still getting used to STATA dates so this was very helpful. From: Phil Clayton <philclayton@internode.on.net> I would do this by reshaping the data to long format. Phil . * enter data . clear . input str10 First str10 Last A10_01_10 A10_05_10 A10_010_10 A10_11_10 First Last A10_01_10 A10_05_10 A10_010~0 A10_11_10 1. "Jane" "Doe" 1 1 . . 2. "John" "Doe" . 1 0 1 3. end . . * clean up variable name/s . * (alternatively you could clean these up after reshaping) . rename A10_010_10 A10_10_10 . list, clean noobs First Last A10_01~0 A10_05~0 A10_10~0 A10_11~0 Jane Doe 1 1 . . John Doe . 1 0 1 . . * reshape to long format . reshape long A, i(First Last) j(datestr) string (note: j = 10_01_10 10_05_10 10_10_10 10_11_10) Data wide -> long - ----------------------------------------------------------------------------- Number of obs. 2 -> 8 Number of variables 6 -> 4 j variable (4 values) -> datestr xij variables: A10_01_10 A10_05_10 ... A10_11_10 -> A - ----------------------------------------------------------------------------- . rename A attended . replace attended=0 if missing(attended) (3 real changes made) . list, clean noobs First Last datestr attended Jane Doe 10_01_10 1 Jane Doe 10_05_10 1 Jane Doe 10_10_10 0 Jane Doe 10_11_10 0 John Doe 10_01_10 0 John Doe 10_05_10 1 John Doe 10_10_10 0 John Doe 10_11_10 1 . . * calculate first and final attendance dates for each person . gen date=date(datestr, "MD20Y") . egen startdate=min(date) if attended, by(First Last) (4 missing values generated) . egen enddate=max(date) if attended, by(First Last) (4 missing values generated) . bysort First Last (startdate): replace startdate=startdate[1] (4 real changes made) . bysort First Last (enddate): replace enddate=enddate[1] (4 real changes made) . format %td date startdate enddate . list, clean noobs First Last datestr attended date startdate enddate Jane Doe 10_01_10 1 01oct2010 01oct2010 05oct2010 Jane Doe 10_05_10 1 05oct2010 01oct2010 05oct2010 Jane Doe 10_11_10 0 11oct2010 01oct2010 05oct2010 Jane Doe 10_10_10 0 10oct2010 01oct2010 05oct2010 John Doe 10_05_10 1 05oct2010 05oct2010 11oct2010 John Doe 10_11_10 1 11oct2010 05oct2010 11oct2010 John Doe 10_10_10 0 10oct2010 05oct2010 11oct2010 John Doe 10_01_10 0 01oct2010 05oct2010 11oct2010 . . * for each date, could that person have attended? . gen byte couldattend=date>=startdate & date<=enddate . . * sum up the possible attendances per person . egen maxpossible=sum(couldattend), by(First Last) . . list, clean noobs First Last datestr attended date startdate enddate coulda~d maxpos~e Jane Doe 10_01_10 1 01oct2010 01oct2010 05oct2010 1 2 Jane Doe 10_05_10 1 05oct2010 01oct2010 05oct2010 1 2 Jane Doe 10_11_10 0 11oct2010 01oct2010 05oct2010 0 2 Jane Doe 10_10_10 0 10oct2010 01oct2010 05oct2010 0 2 John Doe 10_05_10 1 05oct2010 05oct2010 11oct2010 1 3 John Doe 10_11_10 1 11oct2010 05oct2010 11oct2010 1 3 John Doe 10_10_10 0 10oct2010 05oct2010 11oct2010 1 3 John Doe 10_01_10 0 01oct2010 05oct2010 11oct2010 0 3 . . * or instead of the last egen you could just collapse the dataset . collapse (sum) couldattend, by(First Last) . list, clean noobs First Last coulda~d Jane Doe 2 John Doe 3 . On 01/11/2011, at 1:45 PM, Nicole Johnson wrote: > Hi all, > > I have a dataset that is basically set up like an attendance roll book. It has the person's name and then each variable is a date that the program was held. The person has a 1 if they attended that day. It looks like this: > > First Last A10_01_10 A10_05_10 A10_010_10 A10_11_10 > Jane Doe 1 1 . . > John Doe . 1 0 1 > > The records go from October through June, but the program did not meet every day. As noted above, the variable names indicate the date. I was able to use a loop to extract the date of first attendance and last attendance, but I need to now calculate the total number of days the person 'could' have attended the program between their date of first attendance and date of last attendance. SO in the above example I would be able to say that John Doe attended 2 out of 3 possible program days. Of course since the data in my dataset has many more dates, this is much harder! Any help is appreciated. > > I guess I should mention I used the following to calculate some additional variables that may be of use which include string values for date first attended that match the variable names and date values, also the total number of program days. > > Any help is much appreciated - thank you! > Nikki > > ***Macro to find first date of attendance and create string variable 'firstfound' > local first 1 > gen firstfound = "" > foreach v of varlist A10_01_2008-A06_20_2009 { > replace firstfound = "`v'" if `v' == `first' & missing(firstfound) > } > > ***Macro to find last date of attendance and create string variable 'lastfound' > local last 1 > gen lastfound = "" > foreach v of varlist A10_01_2008-A06_20_2009 { > replace lastfound = "`v'" if `v' == `last' > } > > ***Transforming string 'firstfound' into date value first_attend_0809 > . gen firstfound1=substr(firstfound, 2, 10) > . generate first_attend_0809=date(firstfound1,"MDY") > . format first_attend_0809 %td > > ***Transforming string 'lastfound' into date value last_attend_0809 > . gen lastfound1=substr(lastfound, 2, 10) > . generate last_attend_0809=date(lastfound1,"MDY") > . format last_attend_0809 %td > > local start firstfound > gen days_possible = 0 > foreach v of varlist A10_01_2008-A06_20_2009 { > replace days_possible = days_possible+1 > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: RE: statalist-digest V4 #4323***From:*Nicole Johnson <njohnson@researchforaction.org>

- Prev by Date:
**st: RE: statalist-digest V4 #4323** - Next by Date:
**st: Difference Model** - Previous by thread:
**st: RE: statalist-digest V4 #4323** - Next by thread:
**st: Difference Model** - Index(es):