The following material is based on postings
to Statalist.
How can I generate a variable containing the last of several dates?
|
Title
|
|
Generating the last date
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
June 2002; updated April 2005; minor revisions July 2011
|
Question:
I have data with multiple dates. I would like to generate a new date
variable containing the last date for each record. There are several
missing values for these dates.
Answer:
1. The last date is the maximum date
Questions like this arise in various forms, but when discussing possible
answers, it is assumed that your dates are held as Stata dates. For
pertinent online help within Stata, start at
dates and
times. The background on dates in Stata is explained in more detail
in [U] 24 Dealing with dates and times.
The fact that you are dealing with dates does not, for once, complicate this
question. The last date is simply the maximum date. Although you should
always be careful when missing values are present, you can rely on Stata's
maximum functions to do the smart thing about missings. Even though in
Stata the numeric missing is treated as higher than any other numeric value, the
maximum is reported as missing if and only if all values are missing.
First, suppose various dates are held as distinct variables
and you want to get a row-wise maximum; that is, for each observation
across variables. Use
egen
. egen lastdate = rowmax(date variables)
or just the maximum function (see
functions
for more functions) with
generate
. gen lastdate = max(date variables separated by commas)
The rowmax() function is the same as rmax() in Stata 8.
In terms of the result, the choice here is immaterial. Occasionally, when
using the max() function, spelling out all variable
names and separating them by commas can be a little inconvenient. You might
also note for future reference that egen has other functions for
manipulations across variables.
Next, suppose each record in the dataset consists of one or more
observations and the record for each person (company, site, whatever)
is uniquely coded by an identifier, say, id. To get a maximum across
groups of observations, use another egen function:
. by id: egen lastdate = max(date variable)
Here the specification by id ensures that the calculation is carried
out separately for each identifier. Note also that the max() function
of egen differs from the general max() function used with
generate or replace.
2. The new variable needs to be formatted
One detail you need to take care of yourself is attaching a date
format to the new
last date variable. That is, the new date variable does not inherit the
format of the variables from which it is calculated. Thus for a format
%tdd_m_y, type
. format lastdate %tdd_m_y
3. The first date is the minimum date
The first date of several is equally easy in Stata. The
first date is, as will be clear by now, the minimum date. If anything, it
is easier because, in addition to an egen way with min()
or rowmin(), there are approaches from first principles, most notably for
groups of observations on one variable date:
. by id (date), sort: gen firstdate = date[1]
To unpack this command line, Stata sorts on id first and then within
id on date. Then, for each distinct id, the new
variable firstdate is calculated as the first value, which is the
minimum.
The equivalent for last date calculations,
. by id (date), sort: gen lastdate = date[_N]
will not be what you want: any missing values will end up as the last date
for the ids for which they occur. (Recall that any numeric missing
is treated as higher than any other numeric value, and so gets sorted to the
end.) That can be fixed, but most users will find egen with
max() more convenient here.
P.S. There is a tutorial on by:, _n, and _N
manipulations in (Cox 2002), which explains, among other things, the special
interpretation of [1] and [_N] under by:.
Reference
- Cox, N. J. 2002.
- Speaking Stata: How to move step by: step.
Stata Journal
2: 86–102.
|