**[MV] pca** -- Principal component analysis

__Syntax__

Principal component analysis of data

**pca** *varlist* [*if*] [*in*] [*weight*] [**,** *options*]

Principal component analysis of a correlation or covariance matrix

**pcamat** *matname* **,** **n(***#***)** [*options* *pcamat_options*]

*matname* is a k x k symmetric matrix or a k(k+1)/2 long row or column
vector containing the upper or lower triangle of the correlation or
covariance matrix.

*options* Description
-------------------------------------------------------------------------
Model 2
__com__**ponents(***#***)** retain maximum of *#* principal components; __fa__**ctors()**
is a synonym
__mine__**igen(***#***)** retain eigenvalues larger than *#*; default is **1e-5**
__cor__**relation** perform PCA of the correlation matrix; the default
__cov__**ariance** perform PCA of the covariance matrix
**vce(**__non__**e)** do not compute VCE of the eigenvalues and vectors;
the default
**vce(**__nor__**mal)** compute VCE of the eigenvalues and vectors assuming
multivariate normality

Reporting
__l__**evel(***#***)** set confidence level; default is **level(95)**
__bl__**anks(***#***)** display loadings as blank when |loadings| < *#*
**novce** suppress display of SEs even though calculated
# __me__**ans** display summary statistics of variables

Advanced
**tol(***#***)** advanced option; see *Options* for details
**ignore** advanced option; see *Options* for details

__norot__**ated** display unrotated results, even if rotated results
are available (replay only)
-------------------------------------------------------------------------
# **means** is not allowed with **pcamat**.
**norotated** is not available in the dialog box.

*pcamat_options* Description
-------------------------------------------------------------------------
Model
__sh__**ape(**__f__**ull)** *matname* is a square symmetric matrix; the default
__sh__**ape(**__l__**ower)** *matname* is a vector with the rowwise lower triangle
(with diagonal)
__sh__**ape(**__u__**pper)** *matname* is a vector with the rowwise upper triangle
(with diagonal)
__nam__**es(***namelist***)** variable names; required if *matname* is triangular
**forcepsd** modifies *matname* to be positive semidefinite
* **n(***#***)** number of observations
**sds(***matname2***)** vector with standard deviations of variables
**means(***matname3***)** vector with means of variables
-------------------------------------------------------------------------
* **n()** is required for **pcamat**.

**bootstrap**, **by**, **jackknife**, **rolling**, **statsby**, and **xi** are allowed with **pca**;
see prefix. However, **bootstrap** and **jackknife** results should be
interpreted with caution; identification of the **pca** parameters involves
data-dependent restrictions, possibly leading to badly biased and
overdispersed estimates (Milan and Whittaker 1995).
Weights are not allowed with the **bootstrap** prefix.
**aweight**s are not allowed with the **jackknife** prefix.
**aweight**s and **fweight**s are allowed with **pca**; see weight.
See **[MV] pca postestimation** for features available after estimation.

__Menu__

__pca__

**Statistics > Multivariate analysis >** **Factor and principal component**
**analysis >** **Principal component analysis (PCA)**

__pcamat__

**Statistics > Multivariate analysis >** **Factor and principal component**
**analysis >** **PCA of a correlation or covariance matrix**

__Description__

**pca** and **pcamat** display the eigenvalues and eigenvectors from the
principal component analysis (PCA) eigen decomposition. The eigenvectors
are returned in orthonormal form, that is, uncorrelated and normalized.

**pca** can be used to reduce the number of variables or to learn about the
underlying structure of the data. **pcamat** provides the correlation or
covariance matrix directly. For **pca**, the correlation or covariance
matrix is computed from the variables in *varlist*.

__Options__

+---------+
----+ Model 2 +----------------------------------------------------------

**components(***#***)** and **mineigen(***#***)** specify the maximum number of components
(eigenvectors or factors) to be retained. **components()** specifies the
number directly, and **mineigen()** specifies it indirectly, keeping all
components with eigenvalues greater than the indicated value. The
options can be specified individually, together, or not at all.
**factors()** is a synonym for **components()**.

**components(***#***)** sets the maximum number of components (factors) to be
retained. **pca** and **pcamat** always display the full set of eigenvalues
but display eigenvectors only for retained components. Specifying a
number larger than the number of variables in *varlist* is equivalent
to specifying the number of variables in *varlist* and is the default.

**mineigen(***#***)** sets the minimum value of eigenvalues to be retained.
The default is **1e-5** or the value of **tol()** if specified.

Specifying **components()** and **mineigen()** affects only the number of
components to be displayed and stored in **e()**; it does not enforce the
assumption that the other eigenvalues are 0. In particular, the
standard errors reported when **vce(normal)** is specified do not depend
on the number of retained components.

**correlation** and **covariance** specify that principal components be
calculated for the correlation matrix and covariance matrix,
respectively. The default is **correlation**. Unlike factor analysis,
PCA is not scale invariant; the eigenvalues and eigenvectors of a
covariance matrix differ from those of the associated correlation
matrix. Usually, a PCA of a covariance matrix is meaningful only if
the variables are expressed in the same units.

For **pcamat**, do not confuse the type of the matrix to be analyzed with
the type of *matname*. Obviously, if *matname* is a correlation matrix
and the option **sds()** is not specified, it is not possible to perform
a PCA of the covariance matrix.

**vce(none**|**normal)** specifies whether standard errors are to be computed for
the eigenvalues, the eigenvectors, and the (cumulative) percentage of
explained variance (confirmatory PCA). These standard errors are
obtained assuming multivariate normality of the data and are valid
only for a PCA of a covariance matrix. Be cautious if applying these
to correlation matrices.

+-----------+
----+ Reporting +--------------------------------------------------------

**level(***#***)** specifies the confidence level, as a percentage, for confidence
intervals. The default is **level(95)** or as set by **set level**. **level()**
is allowed only with **vce(normal)**.

**blanks(***#***)** shows blanks for loadings with absolute value smaller than *#*.
This option is ignored when specified with **vce(normal)**.

**novce** suppresses the display of standard errors, even though they are
computed, and displays the PCA results in a matrix/table style. You
can specify **novce** during estimation in combination with **vce(normal)**.
More likely, you will want to use **novce** during replay.

**means** displays summary statistics of the variables over the estimation
sample. This option is not available with **pcamat**.

+----------+
----+ Advanced +---------------------------------------------------------

**tol(***#***)** is an advanced, rarely used option and is available only with
**vce(normal)**. An eigenvalue, *ev_i*, is classified as being close to
zero if *ev_i* < *tol* * max(*ev*). Two eigenvalues, *ev_1* and *ev_2*, are
"close" if abs(*ev_1*-*ev_2*) < tol*max(*ev*). The default is **tol(1e-5)**.
See option **ignore** and *Technical note* below.

**ignore** is an advanced, rarely used option and is available only with
**vce(normal)**. It continues the computation of standard errors and
tests, even if some eigenvalues are suspiciously close to zero or
suspiciously close to other eigenvalues, violating crucial
assumptions of the asymptotic theory used to estimate standard errors
and tests. See *Technical note* below.

The following option is available with **pca** and **pcamat** but is not shown in
the dialog box:

**norotated** displays the unrotated principal components, even if rotated
components are available. This option may be specified only when
replaying results.

__Options unique to pcamat__

+-------+
----+ Model +------------------------------------------------------------

**shape(***shape_arg***)** specifies the shape (storage mode) for the covariance or
correlation matrix *matname*. The following shapes are supported:

**full** specifies that the correlation or covariance structure of k
variables is stored as a symmetric k x k matrix. Specifying
**shape(full)** is optional in this case.

**lower** specifies that the correlation or covariance structure of k
variables is stored as a vector with k(k+1)/2 elements in rowwise
lower-triangular order:

C(11) C(21) C(22) C(31) C(32) C(33) ... C(k1) C(k2) ... C(kk)

**upper** specifies that the correlation or covariance structure of k
variables is stored as a vector with k(k+1)/2 elements in rowwise
upper-triangular order:

C(11) C(12) C(13) ... C(1k) C(22) C(23) ... C(2k) ... C(k-1
k-1) C(k-1 k) C(kk)

**names(***namelist***)** specifies a list of k different names, which are used to
document output and to label estimation results and are used as
variable names by **predict**. By default, **pcamat** verifies that the row
and column names of *matname* and the column or row names of *matname2*
and *matname3* from the **sds()** and **means()** options are in agreement.
Using the **names()** option turns off this check.

**forcepsd** modifies the matrix *matname* to be positive semidefinite (psd)
and so to be a proper covariance matrix. If *matname* is not positive
semidefinite, it will have negative eigenvalues. By setting negative
eigenvalues to 0 and reconstructing, we obtain the least-squares
positive-semidefinite approximation to *matname*. This approximation
is a singular covariance matrix.

**n(***#***)** is required and specifies the number of observations.

**sds(***matname2***)** specifies a k x 1 or 1 x k matrix with the standard
deviations of the variables. The row or column names should match
the variable names, unless the **names()** option is specified. **sds()**
may be specified only if *matname* is a correlation matrix.

**means(***matname3***)** specifies a k x 1 or 1 x k matrix with the means of the
variables. The row or column names should match the variable names,
unless the **names()** option is specified. Specify **means()** if you have
variables in your dataset and want to use **predict** after **pcamat**.

__Technical note__

**pca** and **pcamat** with the **vce(normal)** option assume that

(A1) the variables are multivariate normal distributed and

(A2) the variance-covariance matrix of the observations has all
distinct and strictly positive eigenvalues.

Under assumptions A1 and A2, the eigenvalues and eigenvectors of the
sample covariance matrix can be seen as maximum likelihood estimates for
the population analogues that are asymptotically (multivariate) normally
distributed (Anderson 1963; Jackson 2003). See Tyler (1981) for related
results for elliptic distributions. Be cautious when interpreting
because the asymptotic variances are rather sensitive to violations of
assumption A1 (and A2). Wald tests of hypotheses that are in conflict
with assumption A2 (for example, testing that the first and second
eigenvalue are the same) produce incorrect p-values.

Because the statistical theory for a PCA of a correlation matrix is much
more complicated, **pca** and **pcamat** compute standard errors and tests of a
correlation matrix as if it were a covariance matrix. This practice is
in line with the application of asymptotic theory in Jackson (2003).
This will usually lead to some underestimation of standard errors, but we
believe that this problem is smaller than the consequences of deviations
from normality.

You may conduct tests for multivariate normality using the **mvtest**
**normality** command (see **[MV] mvtest normality**).

__Examples__

Standard PCA for descriptive use
**. sysuse auto**
**. pca trunk weight length headroom**
**. pca trunk weight length headroom, comp(2) covariance**

PCA assuming multivariate normality of the data
**. webuse bg2**
**. pca bg2cost*, vce(normal)**

PCA of a covariance or correlation matrix
**. matrix S = ( 10.167, 22.690, 2.040 \ ///**
** 22.690, 56.949, 3.788 \ ///**
** 2.040, 3.788, 0.688 ) **
**. matrix rownames S = visual hearing taste**
**. matrix colnames S = visual hearing taste**
**. pcamat S, n(979) comp(2)**

Same as above
**. matrix S = ( 10.167, 22.690, 2.040, ///**
** 56.949, 3.788, ///**
** 0.688 )**
**. pcamat S, n(979) shape(upper) comp(2)** **names(visual hearing taste)**

__Stored results__

**pca** and **pcamat** without the **vce(normal)** option store the following in **e()**:

Scalars
**e(N)** number of observations
**e(f)** number of retained components
**e(rho)** fraction of explained variance
**e(trace)** trace of **e(C)**
**e(lndet)** ln of the determinant of **e(C)**
**e(cond)** condition number of **e(C)**

Macros
**e(cmd)** **pca** (even for **pcamat**)
**e(cmdline)** command as typed
**e(Ctype)** **correlation** or **covariance**
**e(wtype)** weight type
**e(wexp)** weight expression
**e(title)** title in output
**e(properties)** **nob noV eigen**
**e(rotate_cmd)** program used to implement **rotate**
**e(estat_cmd)** program used to implement **estat**
**e(predict)** program used to implement **predict**
**e(marginsnotok)** predictions disallowed by **margins**

Matrices
**e(C)** p x p correlation or covariance matrix
**e(means)** 1 x p matrix of means
**e(sds)** 1 x p matrix of standard deviations
**e(Ev)** 1 x p matrix of eigenvalues (sorted)
**e(L)** p x f matrix of eigenvectors = components
**e(Psi)** 1 x p matrix of unexplained variance

Functions
**e(sample)** marks estimation sample

**pca** and **pcamat** with the **vce(normal)** option store the above, as well as
the following:

Scalars
**e(v_rho)** variance of **e(rho)**
**e(chi2_i)** chi-squared statistic for test of independence
**e(df_i)** degrees of freedom for test of independence
**e(p_i)** p-value for test of independence
**e(chi2_s)** chi-squared statistic for test of sphericity
**e(df_s)** degrees of freedom for test of sphericity
**e(p_s)** p-value for test of sphericity
**e(rank)** rank of **e(V)**

Macros
**e(vce)** **multivariate normality**
**e(properties)** **b V eigen**

Matrices
**e(b)** 1 x p+fp coefficient vector (all eigenvalues and
retained eigenvectors)
**e(Ev_bias)** 1 x p matrix: bias of eigenvalues
**e(Ev_stats)** p x 5 matrix with statistics on explained variance
**e(V)** variance-covariance matrix of the estimates **e(b)**

__References__

Anderson, T. W. 1963. Asymptotic theory for principal component analysis.
*Annals of Mathematical Statistics* 34: 122-148.

Jackson, J. E. 2003. *A User's Guide to Principal Components*. New York:
Wiley.

Milan, L., and J. Whittaker. 1995. Application of the parametric
bootstrap to models that incorporate a singular value decomposition.
*Applied Statistics* 44: 31-49.

Tyler, D. E. 1981. Asymptotic inference for eigenvectors. *Annals of*
*Statistics* 9: 725-736.