help egen dialog: egen
-------------------------------------------------------------------------------
Title
[D] egen -- Extensions to generate
Syntax
egen [type] newvar = fcn(arguments) [if] [in] [, options]
by is allowed with some of the egen functions, as noted below.
where depending on the fcn, arguments refers to an expression, varlist,
or numlist, and the options are also fcn dependent, and where fcn is
anycount(varlist), values(integer numlist)
may not be combined with by. It returns the number of variables
in varlist for which values are equal to any integer value in a
supplied numlist. Values for any observations excluded by either
if or in are set to 0 (not missing). Also see anyvalue(varname)
and anymatch(varlist).
anymatch(varlist), values(integer numlist)
may not be combined with by. It is 1 if any variable in varlist
is equal to any integer value in a supplied numlist and 0
otherwise. Values for any observations excluded by either if or
in are set to 0 (not missing). Also see anyvalue(varname) and
anycount(varlist).
anyvalue(varname), values(integer numlist)
may not be combined with by. It takes the value of varname if
varname is equal to any integer value in a supplied numlist and
is missing otherwise. Also see anymatch(varlist) and
anycount(varlist).
concat(varlist) [, format(%fmt) decode maxlength(#) punct(pchars)]
may not be combined with by. It concatenates varlist to produce
a string variable. Values of string variables are unchanged.
Values of numeric variables are converted to string, as is, or
are converted using a numeric format under the format(%fmt)
option or decoded under the decode option, in which case
maxlength() may also be used to control the maximum label length
used. By default, variables are added end to end: punct(pchars)
may be used to specify punctuation, such as a space, punct(" "),
or a comma, punct(,).
count(exp) (allows by varlist:)
creates a constant (within varlist) containing the number of
nonmissing observations of exp. Also see rownonmiss() and
rowmiss().
cut(varname), {at(#,#,...,#)|group(#)} [icodes label]
may not be combined with by. It creates a new categorical
variable coded with the left-hand ends of the grouping intervals
specified in the at() option, which expects an ascending numlist.
at(#,#,...,#) supplies the breaks for the groups, in ascending
order. The list of breakpoints may be simply a list of numbers
separated by commas but can also include the syntax a(b)c,
meaning from a to c in steps of size b. If no breaks are
specified, the command expects the group() option.
group(#) specifies the number of equal frequency grouping
intervals to be used in the absence of breaks. Specifying this
option automatically invokes icodes.
icodes requests that the codes 0, 1, 2, etc., be used in place of
the left-hand ends of the intervals.
label requests that the integer-coded values of the grouped
variable be labeled with the left-hand ends of the grouping
intervals. Specifying this option automatically invokes icodes.
diff(varlist)
may not be combined with by. It creates an indicator variable
equal to 1 if the variables in varlist are not equal and 0
otherwise.
ends(strvar) [, punct(pchars) trim [head|tail|last]]
may not be combined with by. It gives the first "word" or head
(with the head option), the last "word" (with the last option),
or the remainder or tail (with the tail option) from string
variable strvar.
head, last, and tail are determined by the occurrence of pchars,
which is by default one space (" ").
The head is whatever precedes the first occurrence of pchars, or
the whole of the string if it does not occur. For example, the
head of "frog toad" is "frog" and that of "frog" is "frog". With
punct(,), the head of "frog,toad" is "frog".
The last word is whatever follows the last occurrence of pchars
or is the whole of the string if a space does not occur. The
last word of "frog toad newt" is "newt" and that of "frog" is
"frog". With punct(,), the last word of "frog,toad" is "toad".
The remainder or tail is whatever follows the first occurrence of
pchars, which will be the empty string "" if pchars does not
occur. The tail of "frog toad newt" is "toad newt" and that of
"frog" is "". With punct(,), the tail of "frog,toad" is "toad".
The trim option trims any leading or trailing spaces.
fill(numlist)
may not be combined with by. It creates a variable of ascending
or descending numbers or complex repeating patterns. numlist
must contain at least two numbers and may be specified using
standard numlist notation; see numlist. if and in are not
allowed with fill().
group(varlist) [, missing label lname(name) truncate(num)]
may not be combined with by. It creates one variable taking on
values 1, 2, ... for the groups formed by varlist. varlist may
contain numeric variables, string variables, or a combination of
the two. The order of the groups is that of the sort order of
varlist. missing indicates that missing values in varlist
(either . or "") are to be treated like any other value when
assigning groups, instead of as missing values being assigned to
the group missing. The label option returns integers from 1 up
according to the distinct groups of varlist in sorted order. The
integers are labeled with the values of varlist or the value
labels, if they exist. lname() specifies the name to be given to
the value label created to hold the labels; lname() implies
label. The truncate() option truncates the values contributed to
the label from each variable in varlist to the length specified
by the integer argument num. The truncate option cannot be used
without specifying the label option. The truncate option does
not change the groups that are formed; it changes only their
labels.
iqr(exp) (allows by varlist:)
creates a constant (within varlist) containing the interquartile
range of exp. See also pctile().
kurt(varname) (allows by varlist:)
returns the kurtosis (within varlist) varname.
mad(exp) (allows by varlist:)
returns the median absolute deviation from the median (within
varlist) of exp.
max(exp) (allows by varlist:)
creates a constant (within varlist) containing the maximum value
of exp.
mdev(exp) (allows by varlist:)
returns the mean absolute deviation from the mean (within
varlist) of exp.
mean(exp) (allows by varlist:)
creates a constant (within varlist) containing the mean of exp.
median(exp) (allows by varlist:)
creates a constant (within varlist) containing the median of exp.
Also see pctile().
min(exp) (allows by varlist:)
creates a constant (within varlist) containing the minimum value
of exp.
mode(varname) [, minmode maxmode nummode(integer) missing]
(allows by varlist:)
produces the mode (within varlist) for varname, which may be
numeric or string. The mode is the value occurring most
frequently. If two or more modes exist or if varname contains
all missing values, the mode produced will be a missing value.
To avoid this, the minmode, maxmode, or nummode() option may be
used to specify choices for selecting among the multiple modes,
and the missing option will treat missing values as categories.
minmode returns the lowest value, and maxmode returns the highest
value. nummode(#) will return the #th mode, counting from the
lowest up. Missing values are excluded from determination of the
mode unless missing is specified. Even so, the value of the mode
is recorded for observations for which the values of varname are
missing unless they are explicitly excluded, that is, by
if varname < . or if varname != "".
mtr(year income)
may not be combined with by. It returns the U.S. marginal income
tax rate for a married couple with taxable income income in year
year, where 1930 < year < 2009. year and income may be specified
as variable names or constants; e.g., mtr(1993 faminc),
mtr(surveyyr 28000), or mtr(surveyyr faminc). A blank or comma
may be used to separate income from year.
pc(exp) [, prop] (allows by varlist:)
returns exp (within varlist) scaled to be a percentage of the
total, between 0 and 100. The prop option returns exp scaled to
be a proportion of the total, between 0 and 1.
pctile(exp) [, p(#)] (allows by varlist:)
creates a constant (within varlist) containing the #th percentile
of exp. If p(#) is not specified, 50 is assumed, meaning
medians. Also see median().
rank(exp) [, field|track|unique] (allows by varlist:)
creates ranks (within varlist) of exp; by default, equal
observations are assigned the average rank. The field option
calculates the field rank of exp: the highest value is ranked 1,
and there is no correction for ties. That is, the field rank is
1 + the number of values that are higher. The track option
calculates the track rank of exp: the lowest value is ranked 1,
and there is no correction for ties. That is, the track rank is
1 + the number of values that are lower. The unique option
calculates the unique rank of exp: values are ranked 1,...,#, and
values and ties are broken arbitrarily. Two values that are tied
for second are ranked 2 and 3.
rowfirst(varlist)
may not be combined with by. It gives the first nonmissing value
in varlist for each observation (row). If all values in varlist
are missing for an observation, newvar is set to missing.
rowlast(varlist)
may not be combined with by. It gives the last nonmissing value
in varlist for each observation (row). If all values in varlist
are missing for an observation, newvar is set to missing.
rowmax(varlist)
may not be combined with by. It gives the maximum value
(ignoring missing values) in varlist for each observation (row).
If all values in varlist are missing for an observation, newvar
is set to missing.
rowmean(varlist)
may not be combined with by. It creates the (row) means of the
variables in varlist, ignoring missing values; for example, if
three variables are specified and, in some observations, one of
the variables is missing, in those observations newvar will
contain the mean of the two variables that do exist. Other
observations will contain the mean of all three variables. Where
none of the variables exist, newvar is set to missing.
rowmedian(varlist)
may not be combined with by. It gives the (row) median of the
variables in varlist, ignoring missing values. If all variables
in varlist are missing for an observation, newvar is set to
missing in that observation. Also see rowpctile().
rowmin(varlist)
may not be combined with by. It gives the minimum value in
varlist for each observation (row). If all values in varlist are
missing for an observation, newvar is set to missing.
rowmiss(varlist)
may not be combined with by. It gives the number of missing
variables in varlist for each observation (row).
rownonmiss(varlist) [, strok]
may not be combined with by. It gives the number of nonmissing
values in varlist for each observation (row) -- this is the value
used by rowmean() for the denominator in the mean calculation.
String variables may not be specified unless the strok option is
also specified. If strok is specified, string variables will be
counted as containing missing values when they contain "".
Numeric variables will be counted as containing missing when
their value is ">.".
rowpctile(varlist) [, p(#)]
may not be combined with by. It gives the #th percentile of the
variables in varlist, ignoring missing values. If all variables
in varlist are missing for an observation, newvar is set to
missing in that observation. If p() is not specified, p(50) is
assumed, meaning medians. Also see rowmedian().
rowsd(varlist)
may not be combined with by. It creates the (row) standard
deviations of the variables in varlist, ignoring missing values.
rowtotal(varlist) [, missing]
may not be combined with by. It creates the (row) sum of the
variables in varlist, treating missing as 0. If missing is
specified and all values in varlist are missing for an
observation, newvar is set to missing.
sd(exp) (allows by varlist:)
creates a constant (within varlist) containing the standard
deviation of exp. Also see mean().
seq() [, from(#) to(#) block(#)] (allows by varlist:)
returns integer sequences. Values start from from() (default 1)
and increase to to() (the default is the maximum number of
values) in blocks (default size 1). If to() is less than the
maximum number, sequences restart at from(). Numbering may also
be separate within groups defined by varlist or decreasing if
to() is less than from(). Sequences depend on the sort order of
observations, following three rules: 1) observations excluded by
if or in are not counted; 2) observations are sorted by varlist,
if specified; and 3) otherwise, the order is that when called.
No arguments are specified.
skew(varname) (allows by varlist:)
returns the skewness (within varlist) of varname.
std(exp) [, mean(#) std(#)]
may not be combined with by. It creates the standardized values
of exp. The options specify the desired mean and standard
deviation. The default is mean(0) and std(1), producing a
variable with mean 0 and standard deviation 1.
tag(varlist) [, missing]
may not be combined with by. It tags just 1 observation in each
distinct group defined by varlist. When all observations in a
group have the same value for a summary variable calculated for
the group, it will be sufficient to use just one value for many
purposes. The result will be 1 if the observation is tagged and
never missing, and 0 otherwise. Values for any observations
excluded by either if or in are set to 0 (not missing). Hence,
if tag is the variable produced by egen tag = tag(varlist), the
idiom if tag is always safe. missing specifies that missing
values of varlist may be included.
total(exp) [, missing] (allows by varlist:)
creates a constant (within varlist) containing the sum of exp
treating missing as 0. If missing is specified and all values in
exp are missing, newvar is set to missing. Also see mean().
Menu
Data > Create or change data > Create new variable (extended)
Description
egen creates newvar of the optionally specified storage type equal to
fcn(arguments). Here fcn() is a function specifically written for egen,
as documented below or as written by users. Only egen functions may be
used with egen, and conversely, only egen may be used to run egen
functions.
Depending on fcn(), arguments, if present, refers to an expression,
varlist, or a numlist, and the options are similarly fcn dependent.
Explicit subscripting (using _N and _n), which is commonly used with
generate, should not be used with egen; see subscripting.
Examples
---------------------------------------------------------------------------
Setup
. webuse egenxmpl
Describe the data
. describe
Create variable containing the mean value of cholesterol
. egen avg = mean(cholesterol)
Create variable containing the deviation from the mean cholesterol level
. generate deviation = chol - avg
---------------------------------------------------------------------------
Setup
. webuse egenxmpl2, clear
Describe the data
. describe
Create variable containing the median length of stay for each diagnostic
code
. by dcode, sort: egen medstay = median(los)
Create variable containing the deviation from the median length of stay
. generate deltalos = los - medstay
---------------------------------------------------------------------------
Setup
. clear
. set obs 5
. generate x = _n if _n != 3
Create variable containing the running sum of x
. generate runsum = sum(x)
Create variable containing a constant equal to the overall sum of x
. egen totalsum = total(x)
List the results
. list
---------------------------------------------------------------------------
Setup
. webuse egenxmpl3, clear
Describe the data
. describe
Create differ containing 1 if inc1, inc2, and inc3 are not all equal, and
0 otherwise
. egen byte differ = diff(inc1 inc2 inc3)
List the observations where incomes differ
. list if differ == 1
---------------------------------------------------------------------------
Setup
. sysuse auto, clear
Create variable containing the ranks of mpg
. egen rank = rank(mpg)
Sort the data on rank
. sort rank
List the results
. list mpg rank
---------------------------------------------------------------------------
Setup
. webuse states1, clear
Describe the data
. describe
Create stdage containing the standardized value of age
. egen stdage = std(age)
Summarize the results
. summarize age stdage
Display the correlation between age and stdage
. correlate age stdage
---------------------------------------------------------------------------
Setup
. webuse egenxmpl4, clear
Create hsum containing the row sum of a, b, and c for each row
. egen hsum = rowtotal(a b c)
Create havg containing the row mean of a, b, and c for each row
. egen havg = rowmean(a b c)
Create hstd containing the row standard deviation of a, b, and c for each
row
. egen hsd = rowsd(a b c)
Create hnonmiss containing the number of nonmissing observations of a, b,
and c for each row
. egen hnonmiss = rownonmiss(a b c)
Create hmiss containing the number of missing observations of a, b, and c
for each row
. egen hmiss = rowmiss(a b c)
List the results
. list
---------------------------------------------------------------------------
Setup
. webuse egenxmpl5, clear
Create rmin containing the minimum within an observation (row) for x, y,
and z
. egen rmin = rowmin(x y z)
Create rmax containing the maximum within an observation (row) for x, y,
and z
. egen rmax = rowmax(x y z)
Create rfirst containing the first nonmissing value within an observation
(row) for x, y, and z
. egen rfirst = rowfirst(x y z)
Create rlast containing the last nonmissing value within an observation
(row) for x, y, and z
. egen rlast = rowlast(x y z)
List the results
. list
---------------------------------------------------------------------------
Setup
. sysuse auto, clear
Create highrep78 containing the value of rep78 if rep78 is equal to 3, 4,
or 5, otherwise highrep78 contains missing (.)
. egen highrep78 = anyvalue(rep78), v(3/5)
List the result
. list rep78 highrep78
---------------------------------------------------------------------------
Setup
. webuse egenxmpl6, clear
Create racesex containing values 1, 2, ..., for the groups formed by race
and sex and containing missing if race or sex are missing
. egen racesex = group(race sex)
List the result
. list race sex racesex in 1/7
Create rs2 containing values 1, 2, ..., for the groups formed by race and
sex, treating missing like any other value
. egen rs2 = group(race sex), missing
List the result
. list race sex rs2 in 1/7
---------------------------------------------------------------------------
Also see
Manual: [D] egen
Help: [D] collapse, [D] generate