**[D] collapse** -- Make dataset of summary statistics

__Syntax__

**collapse** *clist* [*if*] [*in*] [*weight*] [**,** *options*]

where *clist* is either

[**(***stat***)**] *varlist* [ [**(***stat***)**] *...* ]

[**(***stat***)**] *target_var***=***varname* [*target_var***=***varname* *...*] [ [**(***stat***)**] *...*]

or any combination of the *varlist* or *target_var* forms, and *stat* is one of

**mean** means (default)
**median** medians
**p1** 1st percentile
**p2** 2nd percentile
*...* 3rd-49th percentiles
**p50** 50th percentile (same as **median**)
*...* 51st-97th percentiles
**p98** 98th percentile
**p99** 99th percentile
**sd** standard deviations
__sem__**ean** standard error of the mean (**sd/sqrt(n)**)
__seb__**inomial** standard error of the mean, binomial (**sqrt(p(1-p)/n)**)
__sep__**oisson** standard error of the mean, Poisson (**sqrt(mean)**)
**sum** sums
**rawsum** sums, ignoring optionally specified weight except
observations with a weight of zero are excluded
**count** number of nonmissing observations
**percent** percentage of nonmissing observations
**max** maximums
**min** minimums
**iqr** interquartile range
**first** first value
**last** last value
**firstnm** first nonmissing value
**lastnm** last nonmissing value

If *stat* is not specified, **mean** is assumed.

*options* Description
-------------------------------------------------------------------------
Options
**by(***varlist***)** groups over which *stat* is to be calculated
**cw** casewise deletion instead of all possible observations

**fast** do not restore the original dataset should the user
press **Break**; programmer's command
-------------------------------------------------------------------------
*varlist* and *varname* in *clist* may contain time-series operators; see
tsvarlist.
**aweight**s, **fweight**s, **iweight**s, and **pweight**s are allowed; see weight, and
see Weights below. **pweight**s may not be used with **sd**, **semean**,
**sebinomial**, or **sepoisson**. **iweight**s may not be used with **semean**,
**sebinomial**, or **sepoisson**. **aweight**s may not be used with **sebinomial** or
**sepoisson**.
**fast** does not appear in the dialog box.

__Menu__

**Data > Create or change data > Other variable-transformation commands** **>**
**Make dataset of means, medians, etc.**

__Description__

**collapse** converts the dataset in memory into a dataset of means, sums,
medians, etc. *clist* must refer to numeric variables exclusively.

Note: See **[D] contract** if you want to collapse to a dataset of
frequencies.

__Options__

+---------+
----+ Options +----------------------------------------------------------

**by(***varlist***)** specifies the groups over which the means, etc., are to be
calculated. If this option is not specified, the resulting dataset
will contain 1 observation. If it is specified, *varlist* may refer to
either string or numeric variables.

**cw** specifies casewise deletion. If **cw** is not specified, all possible
observations are used for each calculated statistic.

The following option is available with **collapse** but is not shown in the
dialog box:

**fast** specifies that **collapse** not restore the original dataset should the
user press **Break**. **fast** is intended for use by programmers.

__Weights__

**collapse** allows all four weight types; the default is **aweight**s. Weight
normalization impacts only the **sum**, **count**, **sd**, **semean**, and **sebinomial**
statistics.

Let j index observations and i index by-groups. Here are the definitions
for **count** and **sum** with weights:

**count**:
unweighted: N_i, the number of observations in group
i
**aweight**: N_i, the number of observations in group
i
**fweight, iweight, pweight**: sum(w_j), the sum of the weights over
observations in group i
**sum**:
unweighted: sum(x_j), the sum of x_j over
observations in group i
**aweight**: sum(v_j*x_j) over observations in group
i; v_j = weights normalized to sum to
N_i
**fweight, iweight, pweight**: sum(w_j*x_j) over observations in group i

When the **by()** option is not specified, the entire dataset is treated as
one group.

The **sd** statistic with weights returns the bias-corrected standard
deviation, which is based on the factor sqrt(N_i/(N_i-1)), where N_i is
the number of observations. Statistics **sd**, **semean**, **sebinomial**, and
**sepoisson** are not allowed with **pweight**ed data. Otherwise, the statistic
is changed by the weights through the computation of the weighted count,
as outlined above.

For instance, consider a case in which there are 25 observations in the
dataset and a weighting variable that sums to 57. In the unweighted
case, the weight is not specified, and the count is 25. In the
analytically weighted case, the count is still 25; the scale of the
weight is irrelevant. In the frequency-weighted case, however, the count
is 57, the sum of the weights.

The **rawsum** statistic with **aweight**s ignores the weight, with one
exception: observations with zero weight will not be included in the
sum.

__Examples__

---------------------------------------------------------------------------
Setup
**. webuse college**
**. describe**
**. list**

Create dataset containing the 25th percentile of **gpa** for each **year**
**. collapse (p25) gpa [fw=number], by(year)**

List the result
**. list**

---------------------------------------------------------------------------
Setup
**. webuse college, clear**

Create dataset containing the mean and median of **gpa** and **hour** for each
**year**, and store median of **gpa** and **hour** in **medgpa** and **medhour**,
respectively
**. collapse (mean) gpa hour (median) medgpa=gpa medhour=hour**
**[fw=number], by(year)**

List the result
**. list**

---------------------------------------------------------------------------
Setup
**. webuse college, clear**

Create dataset containing the count of **gpa** and **hour** and the minimums of
**gpa** and **hour**, and store the minimums in **mingpa** and **minhour**, respectively
**. collapse (count) gpa hour (min) mingpa=gpa minhour=hour**
**[fw=number], by(year)**

List the result
**. list**

---------------------------------------------------------------------------
Setup
**. webuse college, clear**
**. replace gpa = . in 3**

Create dataset containing the percentage of observations in each **year**
where the totals are weighted counts of nonmissing **gpa** and **hours**
**. collapse (percent) gpa hour [fw=number], by(year)**

List the result
**. list**

---------------------------------------------------------------------------
Setup
**. webuse college, clear**
**. replace gpa = . in 2/4**

Create dataset containing the mean of **gpa** and **hour** for each **year**, but
ignore all observations that have missing values when calculating the
means
**. collapse (mean) gpa hour [fw=number], by(year) cw**

List the result
**. list**
---------------------------------------------------------------------------