**[D] egen** -- Extensions to generate

__Syntax__

**egen** [*type*] *newvar* **=** *fcn***(***arguments***)** [*if*] [*in*] [**,** *options*]

**by** is allowed with some of the **egen** functions, as noted below.

where depending on the *fcn*, *arguments* refers to an expression, *varlist*,
or *numlist*, and the *options* are also *fcn* dependent, and where *fcn* is

**anycount(***varlist***),** __v__**alues(***integer* *numlist***)**
may not be combined with **by**. It returns the number of variables
in *varlist* for which values are equal to any integer value in a
supplied *numlist*. Values for any observations excluded by either
**if** or **in** are set to 0 (not missing). Also see **anyvalue(***varname***)**
and **anymatch(***varlist***)**.

**anymatch(***varlist***),** __v__**alues(***integer* *numlist***)**
may not be combined with **by**. It is 1 if any variable in *varlist*
is equal to any integer value in a supplied *numlist* and 0
otherwise. Values for any observations excluded by either **if** or
**in** are set to 0 (not missing). Also see **anyvalue(***varname***)** and
**anycount(***varlist***)**.

**anyvalue(***varname***),** __v__**alues(***integer* *numlist***)**
may not be combined with **by**. It takes the value of *varname* if
*varname* is equal to any integer value in a supplied *numlist* and
is missing otherwise. Also see **anymatch(***varlist***)** and
**anycount(***varlist***)**.

**concat(***varlist***)** [**,** __f__**ormat(***%fmt***)** __d__**ecode** __maxl__**ength(***#***)** __p__**unct(***pchars***)**]
may not be combined with **by**. It concatenates *varlist* to produce
a string variable. Values of string variables are unchanged.
Values of numeric variables are converted to string, as is, or
are converted using a numeric format under the **format(%***fmt***)**
option or decoded under the **decode** option, in which case
**maxlength()** may also be used to control the maximum label length
used. By default, variables are added end to end: **punct(***pchars***)**
may be used to specify punctuation, such as a space, **punct(" ")**,
or a comma, **punct(,)**.

**count(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the number of
nonmissing observations of *exp*. Also see **rownonmiss()** and
**rowmiss()**.

**cut(***varname***),** {**at(***#***,***#***,***...***,***#***)**|__g__**roup(***#***)**} [__ic__**odes** __lab__**el**]
may not be combined with **by**. It creates a new categorical
variable coded with the left-hand ends of the grouping intervals
specified in the **at()** option, which expects an ascending numlist.

**at(***#***,***#***,***...***,***#***)** supplies the breaks for the groups, in ascending
order. The list of breakpoints may be simply a list of numbers
separated by commas but can also include the syntax **a(b)c**,
meaning from **a** to **c** in steps of size **b**. *newvar* is set to missing
for observations with *varname* less than the first number
specified in **at()** and for observations with *varname* greater than
or equal to the last number specified in **at()**. If no breaks are
specified, the command expects the **group()** option.

**group(***#***)** specifies the number of equal frequency grouping
intervals to be used in the absence of breaks. Specifying this
option automatically invokes **icodes**.

**icodes** requests that the codes 0, 1, 2, etc., be used in place of
the left-hand ends of the intervals.

**label** requests that the integer-coded values of the grouped
variable be labeled with the left-hand ends of the grouping
intervals. Specifying this option automatically invokes **icodes**.

**diff(***varlist***)**
may not be combined with **by**. It creates an indicator variable
equal to 1 if the variables in *varlist* are not equal and 0
otherwise.

**ends(***strvar***)** [**,** __p__**unct(***pchars***)** __tr__**im** [__h__**ead**|__l__**ast**|__t__**ail**]]
may not be combined with **by**. It gives the first "word" or head
(with the **head** option), the last "word" (with the **last** option),
or the remainder or tail (with the **tail** option) from string
variable *strvar*.

**head**, **last**, and **tail** are determined by the occurrence of *pchars*,
which is by default one space (" ").

The head is whatever precedes the first occurrence of *pchars*, or
the whole of the string if it does not occur. For example, the
head of "frog toad" is "frog" and that of "frog" is "frog". With
**punct(,)**, the head of "frog,toad" is "frog".

The last word is whatever follows the last occurrence of *pchars*
or is the whole of the string if a space does not occur. The
last word of "frog toad newt" is "newt" and that of "frog" is
"frog". With **punct(,)**, the last word of "frog,toad" is "toad".

The remainder or tail is whatever follows the first occurrence of
*pchars*, which will be the empty string "" if *pchars* does not
occur. The tail of "frog toad newt" is "toad newt" and that of
"frog" is "". With **punct(,)**, the tail of "frog,toad" is "toad".

The **trim** option trims any leading or trailing spaces.

**fill(***numlist***)**
may not be combined with **by**. It creates a variable of ascending
or descending numbers or complex repeating patterns. *numlist*
must contain at least two numbers and may be specified using
standard *numlist* notation; see numlist. **if** and **in** are not
allowed with **fill()**.

**group(***varlist***)** [**,** __m__**issing** __l__**abel** **lname(***name***)** __t__**runcate(***num***)**]
may not be combined with **by**. It creates one variable taking on
values 1, 2, ... for the groups formed by *varlist*. *varlist* may
contain numeric variables, string variables, or a combination of
the two. The order of the groups is that of the sort order of
*varlist*. **missing** indicates that missing values in *varlist*
(either **.** or **""**) are to be treated like any other value when
assigning groups, instead of as missing values being assigned to
the group missing. The **label** option returns integers from 1 up
according to the distinct groups of *varlist* in sorted order. The
integers are labeled with the values of *varlist* or the value
labels, if they exist. **lname()** specifies the name to be given to
the value label created to hold the labels; **lname()** implies
**label**. The **truncate()** option truncates the values contributed to
the label from each variable in *varlist* to the length specified
by the integer argument *num*. The **truncate** option cannot be used
without specifying the **label** option. The **truncate** option does
not change the groups that are formed; it changes only their
labels.

**iqr(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the interquartile
range of *exp*. Also see **pctile()**.

**kurt(***varname***)** (allows **by** *varlist***:**)
returns the kurtosis (within *varlist*) of *varname*.

**mad(***exp***)** (allows **by** *varlist***:**)
returns the median absolute deviation from the median (within
*varlist*) of *exp*.

**max(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the maximum value
of *exp*.

**mdev(***exp***)** (allows **by** *varlist***:**)
returns the mean absolute deviation from the mean (within
*varlist*) of *exp*.

**mean(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the mean of *exp*.

**median(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the median of *exp*.
Also see **pctile()**.

**min(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the minimum value
of *exp*.

**mode(***varname***)** [**,** __min__**mode** __max__**mode** __num__**mode(***integer***)** __miss__**ing**]
(allows **by** *varlist***:**)
produces the mode (within *varlist*) for *varname*, which may be
numeric or string. The mode is the value occurring most
frequently. If two or more modes exist or if *varname* contains
all missing values, the mode produced will be a missing value.
To avoid this, the **minmode**, **maxmode**, or **nummode()** option may be
used to specify choices for selecting among the multiple modes,
and the **missing** option will treat missing values as categories.
**minmode** returns the lowest value, and **maxmode** returns the highest
value. **nummode(***#***)** will return the *#*th mode, counting from the
lowest up. Missing values are excluded from determination of the
mode unless **missing** is specified. Even so, the value of the mode
is recorded for observations for which the values of *varname* are
missing unless they are explicitly excluded, that is, by
**if** *varname* **< .** or **if** *varname* **!= ""**.

**mtr(***year income***)**
may not be combined with **by**. It returns the U.S. marginal income
tax rate for a married couple with taxable income *income* in year
*year*, where 1930 __<__ *year* __<__ 2017. *year* and *income* may be specified
as variable names or constants; for example, **mtr(1993 faminc)**,
**mtr(surveyyr 28000)**, or **mtr(surveyyr faminc)**. A blank or comma
may be used to separate *income* from *year*.

**pc(***exp***)** [**, prop**] (allows **by** *varlist***:**)
returns *exp* (within *varlist*) scaled to be a percentage of the
total, between 0 and 100. The **prop** option returns *exp* scaled to
be a proportion of the total, between 0 and 1.

**pctile(***exp***)** [**, p(***#***)**] (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the *#*th percentile
of *exp*. If **p(***#***)** is not specified, 50 is assumed, meaning
medians. Also see **median()**.

**rank(***exp***)** [**,** __f__**ield**|__t__**rack**|__u__**nique**] (allows **by** *varlist***:**)
creates ranks (within *varlist*) of *exp*; by default, equal
observations are assigned the average rank. The **field** option
calculates the field rank of *exp*: the highest value is ranked 1,
and there is no correction for ties. That is, the field rank is
1 + the number of values that are higher. The **track** option
calculates the track rank of *exp*: the lowest value is ranked 1,
and there is no correction for ties. That is, the track rank is
1 + the number of values that are lower. The **unique** option
calculates the unique rank of *exp*: values are ranked 1,...,*#*, and
values and ties are broken arbitrarily. Two values that are tied
for second are ranked 2 and 3.

**rowfirst(***varlist***)**
may not be combined with **by**. It gives the first nonmissing value
in *varlist* for each observation (row). If all values in *varlist*
are missing for an observation, *newvar* is set to missing.

**rowlast(***varlist***)**
may not be combined with **by**. It gives the last nonmissing value
in *varlist* for each observation (row). If all values in *varlist*
are missing for an observation, *newvar* is set to missing.

**rowmax(***varlist***)**
may not be combined with **by**. It gives the maximum value
(ignoring missing values) in *varlist* for each observation (row).
If all values in *varlist* are missing for an observation, *newvar*
is set to missing.

**rowmean(***varlist***)**
may not be combined with **by**. It creates the (row) means of the
variables in *varlist*, ignoring missing values; for example, if
three variables are specified and, in some observations, one of
the variables is missing, in those observations *newvar* will
contain the mean of the two variables that do exist. Other
observations will contain the mean of all three variables. Where
none of the variables exist, *newvar* is set to missing.

**rowmedian(***varlist***)**
may not be combined with **by**. It gives the (row) median of the
variables in *varlist*, ignoring missing values. If all variables
in *varlist* are missing for an observation, *newvar* is set to
missing in that observation. Also see **rowpctile()**.

**rowmin(***varlist***)**
may not be combined with **by**. It gives the minimum value in
*varlist* for each observation (row). If all values in *varlist* are
missing for an observation, *newvar* is set to missing.

**rowmiss(***varlist***)**
may not be combined with **by**. It gives the number of missing
values in *varlist* for each observation (row).

**rownonmiss(***varlist***)** [**,** __s__**trok**]
may not be combined with **by**. It gives the number of nonmissing
values in *varlist* for each observation (row) -- this is the value
used by **rowmean()** for the denominator in the mean calculation.

String variables may not be specified unless the **strok** option is
also specified. If **strok** is specified, string variables will be
counted as containing missing values when they contain "".
Numeric variables will be counted as containing missing when
their value is "__>__**.**".

**rowpctile(***varlist***)** [**, p(***#***)**]
may not be combined with **by**. It gives the *#*th percentile of the
variables in *varlist*, ignoring missing values. If all variables
in *varlist* are missing for an observation, *newvar* is set to
missing in that observation. If **p()** is not specified, **p(50)** is
assumed, meaning medians. Also see **rowmedian()**.

**rowsd(***varlist***)**
may not be combined with **by**. It creates the (row) standard
deviations of the variables in *varlist*, ignoring missing values.

**rowtotal(***varlist***)** [**,** __m__**issing**]
may not be combined with **by**. It creates the (row) sum of the
variables in *varlist*, treating missing as 0. If **missing** is
specified and all values in *varlist* are missing for an
observation, *newvar* is set to missing.

**sd(***exp***)** (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the standard
deviation of *exp*. Also see **mean()**.

**seq()** [**,** __f__**rom(***#***)** __t__**o(***#***)** __b__**lock(***#***)**] (allows **by** *varlist***:**)
returns integer sequences. Values start from **from()** (default 1)
and increase to **to()** (the default is the maximum number of
values) in blocks (default size 1). If **to()** is less than the
maximum number, sequences restart at **from()**. Numbering may also
be separate within groups defined by *varlist* or decreasing if
**to()** is less than **from()**. Sequences depend on the sort order of
observations, following three rules: 1) observations excluded by
**if** or **in** are not counted; 2) observations are sorted by *varlist*,
if specified; and 3) otherwise, the order is that when called.
No *arguments* are specified.

**skew(***varname***)** (allows **by** *varlist***:**)
returns the skewness (within *varlist*) of *varname*.

**std(***exp***)** [**,** __m__**ean(***#***)** __s__**td(***#***)**]
may not be combined with **by**. It creates the standardized values
of *exp*. The options specify the desired mean and standard
deviation. The default is **mean(0)** and **std(1)**, producing a
variable with mean 0 and standard deviation 1.

**tag(***varlist***)** [**,** __m__**issing**]
may not be combined with **by**. It tags just one observation in
each distinct group defined by *varlist*. When all observations in
a group have the same value for a summary variable calculated for
the group, it will be sufficient to use just one value for many
purposes. The result will be 1 if the observation is tagged and
never missing, and 0 otherwise. Values for any observations
excluded by either **if** or **in** are set to 0 (not missing). Hence,
if **tag** is the variable produced by **egen tag =** **tag(***varlist***)**, the
idiom **if tag** is always safe. **missing** specifies that missing
values of *varlist* may be included.

**total(***exp***)** [**,** __m__**issing**] (allows **by** *varlist***:**)
creates a constant (within *varlist*) containing the sum of *exp*
treating missing as 0. If **missing** is specified and all values in
*exp* are missing, *newvar* is set to missing. Also see **mean()**.

__Menu__

**Data > Create or change data > Create new variable (extended)**

__Description__

**egen** creates *newvar* of the optionally specified storage type equal to
*fcn***(***arguments***)**. Here *fcn***()** is a function specifically written for **egen**,
as documented below or as written by users. Only **egen** functions may be
used with **egen**, and conversely, only **egen** may be used to run **egen**
functions.

Depending on *fcn***()**, *arguments*, if present, refers to an expression,
*varlist*, or a *numlist*, and the *options* are similarly *fcn* dependent.
Explicit subscripting (using **_N** and **_n**), which is commonly used with
**generate**, should not be used with **egen**; see subscripting.

__Examples__

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl**

Describe the data
**. describe**

Create variable containing the mean value of **cholesterol**
**. egen avg = mean(cholesterol)**

Create variable containing the deviation from the mean cholesterol level
**. generate deviation = chol - avg**

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl2, clear**

Describe the data
**. describe**

Create variable containing the median length of stay for each diagnostic
code
**. by dcode, sort: egen medstay = median(los)**

Create variable containing the deviation from the median length of stay
**. generate deltalos = los - medstay**

---------------------------------------------------------------------------
Setup
**. clear**
**. set obs 5**
**. generate x = _n if _n != 3**

Create variable containing the running sum of **x**
**. generate runsum = sum(x)**

Create variable containing a constant equal to the overall sum of **x**
**. egen totalsum = total(x)**

List the results
**. list**

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl3, clear**

Describe the data
**. describe**

Create **differ** containing 1 if **inc1**, **inc2**, and **inc3** are not all equal, and
0 otherwise
**. egen byte differ = diff(inc1 inc2 inc3)**

List the observations where incomes differ
**. list if differ == 1**

---------------------------------------------------------------------------
Setup
**. sysuse auto, clear**

Create variable containing the ranks of **mpg**
**. egen rank = rank(mpg)**

Sort the data on **rank**
**. sort rank**

List the results
**. list mpg rank**

---------------------------------------------------------------------------
Setup
**. webuse states1, clear**

Describe the data
**. describe**

Create **stdage** containing the standardized value of **age**
**. egen stdage = std(age)**

Summarize the results
**. summarize age stdage**

Display the correlation between **age** and **stdage**
**. correlate age stdage**

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl4, clear**

Create **hsum** containing the row sum of **a**, **b**, and **c** for each row
**. egen hsum = rowtotal(a b c)**

Create **havg** containing the row mean of **a**, **b**, and **c** for each row
**. egen havg = rowmean(a b c)**

Create **hstd** containing the row standard deviation of **a**, **b**, and **c** for each
row
**. egen hsd = rowsd(a b c)**

Create **hnonmiss** containing the number of nonmissing observations of **a**, **b**,
and **c** for each row
**. egen hnonmiss = rownonmiss(a b c)**

Create **hmiss** containing the number of missing observations of **a**, **b**, and **c**
for each row
**. egen hmiss = rowmiss(a b c)**

List the results
**. list**

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl5, clear**

Create **rmin** containing the minimum within an observation (row) for **x**, **y**,
and **z**
**. egen rmin = rowmin(x y z)**

Create **rmax** containing the maximum within an observation (row) for **x**, **y**,
and **z**
**. egen rmax = rowmax(x y z)**

Create **rfirst** containing the first nonmissing value within an observation
(row) for **x**, **y**, and **z**
**. egen rfirst = rowfirst(x y z)**

Create **rlast** containing the last nonmissing value within an observation
(row) for **x**, **y**, and **z**
**. egen rlast = rowlast(x y z)**

List the results
**. list**

---------------------------------------------------------------------------
Setup
**. sysuse auto, clear**

Create **highrep78** containing the value of **rep78** if **rep78** is equal to 3, 4,
or 5, otherwise **highrep78** contains missing (**.**)
**. egen highrep78 = anyvalue(rep78), v(3/5)**

List the result
**. list rep78 highrep78**

---------------------------------------------------------------------------
Setup
**. webuse egenxmpl6, clear**

Create **racesex** containing values 1, 2, ..., for the groups formed by **race**
and **sex** and containing missing if **race** or **sex** are missing
**. egen racesex = group(race sex)**

List the result
**. list race sex racesex in 1/7**

Create **rs2** containing values 1, 2, ..., for the groups formed by **race** and
**sex**, treating missing like any other value
**. egen rs2 = group(race sex), missing**

List the result
**. list race sex rs2 in 1/7**
---------------------------------------------------------------------------