Stata 15 help for pca

[MV] pca -- Principal component analysis

Syntax

Principal component analysis of data

pca varlist [if] [in] [weight] [, options]

Principal component analysis of a correlation or covariance matrix

pcamat matname , n(#) [options pcamat_options]

matname is a k x k symmetric matrix or a k(k+1)/2 long row or column vector containing the upper or lower triangle of the correlation or covariance matrix.

options Description ------------------------------------------------------------------------- Model 2 components(#) retain maximum of # principal components; factors() is a synonym mineigen(#) retain eigenvalues larger than #; default is 1e-5 correlation perform PCA of the correlation matrix; the default covariance perform PCA of the covariance matrix vce(none) do not compute VCE of the eigenvalues and vectors; the default vce(normal) compute VCE of the eigenvalues and vectors assuming multivariate normality

Reporting level(#) set confidence level; default is level(95) blanks(#) display loadings as blank when |loadings| < # novce suppress display of SEs even though calculated # means display summary statistics of variables

Advanced tol(#) advanced option; see Options for details ignore advanced option; see Options for details

norotated display unrotated results, even if rotated results are available (replay only) ------------------------------------------------------------------------- # means is not allowed with pcamat. norotated is not available in the dialog box.

pcamat_options Description ------------------------------------------------------------------------- Model shape(full) matname is a square symmetric matrix; the default shape(lower) matname is a vector with the rowwise lower triangle (with diagonal) shape(upper) matname is a vector with the rowwise upper triangle (with diagonal) names(namelist) variable names; required if matname is triangular forcepsd modifies matname to be positive semidefinite * n(#) number of observations sds(matname2) vector with standard deviations of variables means(matname3) vector with means of variables ------------------------------------------------------------------------- * n() is required for pcamat.

bootstrap, by, jackknife, rolling, statsby, and xi are allowed with pca; see prefix. However, bootstrap and jackknife results should be interpreted with caution; identification of the pca parameters involves data-dependent restrictions, possibly leading to badly biased and overdispersed estimates (Milan and Whittaker 1995). Weights are not allowed with the bootstrap prefix. aweights are not allowed with the jackknife prefix. aweights and fweights are allowed with pca; see weight. See [MV] pca postestimation for features available after estimation.

Menu

pca

Statistics > Multivariate analysis > Factor and principal component analysis > Principal component analysis (PCA)

pcamat

Statistics > Multivariate analysis > Factor and principal component analysis > PCA of a correlation or covariance matrix

Description

pca and pcamat display the eigenvalues and eigenvectors from the principal component analysis (PCA) eigen decomposition. The eigenvectors are returned in orthonormal form, that is, uncorrelated and normalized.

pca can be used to reduce the number of variables or to learn about the underlying structure of the data. pcamat provides the correlation or covariance matrix directly. For pca, the correlation or covariance matrix is computed from the variables in varlist.

Options

+---------+ ----+ Model 2 +----------------------------------------------------------

components(#) and mineigen(#) specify the maximum number of components (eigenvectors or factors) to be retained. components() specifies the number directly, and mineigen() specifies it indirectly, keeping all components with eigenvalues greater than the indicated value. The options can be specified individually, together, or not at all. factors() is a synonym for components().

components(#) sets the maximum number of components (factors) to be retained. pca and pcamat always display the full set of eigenvalues but display eigenvectors only for retained components. Specifying a number larger than the number of variables in varlist is equivalent to specifying the number of variables in varlist and is the default.

mineigen(#) sets the minimum value of eigenvalues to be retained. The default is 1e-5 or the value of tol() if specified.

Specifying components() and mineigen() affects only the number of components to be displayed and stored in e(); it does not enforce the assumption that the other eigenvalues are 0. In particular, the standard errors reported when vce(normal) is specified do not depend on the number of retained components.

correlation and covariance specify that principal components be calculated for the correlation matrix and covariance matrix, respectively. The default is correlation. Unlike factor analysis, PCA is not scale invariant; the eigenvalues and eigenvectors of a covariance matrix differ from those of the associated correlation matrix. Usually, a PCA of a covariance matrix is meaningful only if the variables are expressed in the same units.

For pcamat, do not confuse the type of the matrix to be analyzed with the type of matname. Obviously, if matname is a correlation matrix and the option sds() is not specified, it is not possible to perform a PCA of the covariance matrix.

vce(none|normal) specifies whether standard errors are to be computed for the eigenvalues, the eigenvectors, and the (cumulative) percentage of explained variance (confirmatory PCA). These standard errors are obtained assuming multivariate normality of the data and are valid only for a PCA of a covariance matrix. Be cautious if applying these to correlation matrices.

+-----------+ ----+ Reporting +--------------------------------------------------------

level(#) specifies the confidence level, as a percentage, for confidence intervals. The default is level(95) or as set by set level. level() is allowed only with vce(normal).

blanks(#) shows blanks for loadings with absolute value smaller than #. This option is ignored when specified with vce(normal).

novce suppresses the display of standard errors, even though they are computed, and displays the PCA results in a matrix/table style. You can specify novce during estimation in combination with vce(normal). More likely, you will want to use novce during replay.

means displays summary statistics of the variables over the estimation sample. This option is not available with pcamat.

+----------+ ----+ Advanced +---------------------------------------------------------

tol(#) is an advanced, rarely used option and is available only with vce(normal). An eigenvalue, ev_i, is classified as being close to zero if ev_i < tol * max(ev). Two eigenvalues, ev_1 and ev_2, are "close" if abs(ev_1-ev_2) < tol*max(ev). The default is tol(1e-5). See option ignore and Technical note below.

ignore is an advanced, rarely used option and is available only with vce(normal). It continues the computation of standard errors and tests, even if some eigenvalues are suspiciously close to zero or suspiciously close to other eigenvalues, violating crucial assumptions of the asymptotic theory used to estimate standard errors and tests. See Technical note below.

The following option is available with pca and pcamat but is not shown in the dialog box:

norotated displays the unrotated principal components, even if rotated components are available. This option may be specified only when replaying results.

Options unique to pcamat

+-------+ ----+ Model +------------------------------------------------------------

shape(shape_arg) specifies the shape (storage mode) for the covariance or correlation matrix matname. The following shapes are supported:

full specifies that the correlation or covariance structure of k variables is stored as a symmetric k x k matrix. Specifying shape(full) is optional in this case.

lower specifies that the correlation or covariance structure of k variables is stored as a vector with k(k+1)/2 elements in rowwise lower-triangular order:

C(11) C(21) C(22) C(31) C(32) C(33) ... C(k1) C(k2) ... C(kk)

upper specifies that the correlation or covariance structure of k variables is stored as a vector with k(k+1)/2 elements in rowwise upper-triangular order:

C(11) C(12) C(13) ... C(1k) C(22) C(23) ... C(2k) ... C(k-1 k-1) C(k-1 k) C(kk)

names(namelist) specifies a list of k different names, which are used to document output and to label estimation results and are used as variable names by predict. By default, pcamat verifies that the row and column names of matname and the column or row names of matname2 and matname3 from the sds() and means() options are in agreement. Using the names() option turns off this check.

forcepsd modifies the matrix matname to be positive semidefinite (psd) and so to be a proper covariance matrix. If matname is not positive semidefinite, it will have negative eigenvalues. By setting negative eigenvalues to 0 and reconstructing, we obtain the least-squares positive-semidefinite approximation to matname. This approximation is a singular covariance matrix.

n(#) is required and specifies the number of observations.

sds(matname2) specifies a k x 1 or 1 x k matrix with the standard deviations of the variables. The row or column names should match the variable names, unless the names() option is specified. sds() may be specified only if matname is a correlation matrix.

means(matname3) specifies a k x 1 or 1 x k matrix with the means of the variables. The row or column names should match the variable names, unless the names() option is specified. Specify means() if you have variables in your dataset and want to use predict after pcamat.

Technical note

pca and pcamat with the vce(normal) option assume that

(A1) the variables are multivariate normal distributed and

(A2) the variance-covariance matrix of the observations has all distinct and strictly positive eigenvalues.

Under assumptions A1 and A2, the eigenvalues and eigenvectors of the sample covariance matrix can be seen as maximum likelihood estimates for the population analogues that are asymptotically (multivariate) normally distributed (Anderson 1963; Jackson 2003). See Tyler (1981) for related results for elliptic distributions. Be cautious when interpreting because the asymptotic variances are rather sensitive to violations of assumption A1 (and A2). Wald tests of hypotheses that are in conflict with assumption A2 (for example, testing that the first and second eigenvalue are the same) produce incorrect p-values.

Because the statistical theory for a PCA of a correlation matrix is much more complicated, pca and pcamat compute standard errors and tests of a correlation matrix as if it were a covariance matrix. This practice is in line with the application of asymptotic theory in Jackson (2003). This will usually lead to some underestimation of standard errors, but we believe that this problem is smaller than the consequences of deviations from normality.

You may conduct tests for multivariate normality using the mvtest normality command (see [MV] mvtest normality).

Examples

Standard PCA for descriptive use . sysuse auto . pca trunk weight length headroom . pca trunk weight length headroom, comp(2) covariance

PCA assuming multivariate normality of the data . webuse bg2 . pca bg2cost*, vce(normal)

PCA of a covariance or correlation matrix . matrix S = ( 10.167, 22.690, 2.040 \ /// 22.690, 56.949, 3.788 \ /// 2.040, 3.788, 0.688 ) . matrix rownames S = visual hearing taste . matrix colnames S = visual hearing taste . pcamat S, n(979) comp(2)

Same as above . matrix S = ( 10.167, 22.690, 2.040, /// 56.949, 3.788, /// 0.688 ) . pcamat S, n(979) shape(upper) comp(2) names(visual hearing taste)

Stored results

pca and pcamat without the vce(normal) option store the following in e():

Scalars e(N) number of observations e(f) number of retained components e(rho) fraction of explained variance e(trace) trace of e(C) e(lndet) ln of the determinant of e(C) e(cond) condition number of e(C)

Macros e(cmd) pca (even for pcamat) e(cmdline) command as typed e(Ctype) correlation or covariance e(wtype) weight type e(wexp) weight expression e(title) title in output e(properties) nob noV eigen e(rotate_cmd) program used to implement rotate e(estat_cmd) program used to implement estat e(predict) program used to implement predict e(marginsnotok) predictions disallowed by margins

Matrices e(C) p x p correlation or covariance matrix e(means) 1 x p matrix of means e(sds) 1 x p matrix of standard deviations e(Ev) 1 x p matrix of eigenvalues (sorted) e(L) p x f matrix of eigenvectors = components e(Psi) 1 x p matrix of unexplained variance

Functions e(sample) marks estimation sample

pca and pcamat with the vce(normal) option store the above, as well as the following:

Scalars e(v_rho) variance of e(rho) e(chi2_i) chi-squared statistic for test of independence e(df_i) degrees of freedom for test of independence e(p_i) p-value for test of independence e(chi2_s) chi-squared statistic for test of sphericity e(df_s) degrees of freedom for test of sphericity e(p_s) p-value for test of sphericity e(rank) rank of e(V)

Macros e(vce) multivariate normality e(properties) b V eigen

Matrices e(b) 1 x p+fp coefficient vector (all eigenvalues and retained eigenvectors) e(Ev_bias) 1 x p matrix: bias of eigenvalues e(Ev_stats) p x 5 matrix with statistics on explained variance e(V) variance-covariance matrix of the estimates e(b)

References

Anderson, T. W. 1963. Asymptotic theory for principal component analysis. Annals of Mathematical Statistics 34: 122-148.

Jackson, J. E. 2003. A User's Guide to Principal Components. New York: Wiley.

Milan, L., and J. Whittaker. 1995. Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics 44: 31-49.

Tyler, D. E. 1981. Asymptotic inference for eigenvectors. Annals of Statistics 9: 725-736.


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index