Stata 15 help for camat

[MV] ca -- Simple correspondence analysis

Syntax

Simple correspondence analysis of two categorical variables

ca rowvar colvar [if] [in] [weight] [, options]

Simple correspondence analysis with crossed (stacked) variables

ca row_spec col_spec [if] [in] [weight] [, options]

Simple correspondence analysis of an n_r x n_c matrix

camat matname [, options]

where spec = varname | (newvar : varlist)

and matname is an n_r x n_c matrix with n_r, n_c > 2.

options Description ------------------------------------------------------------------------- Model 2 dimensions(#) number of dimensions (factors, axes); default is dim(2) normalize(nopts) normalization of row and column coordinates rowsupp(matname_r) matrix of supplementary rows colsupp(matname_c) matrix of supplementary columns rowname(string) label for rows colname(string) label for columns missing treat missing values as ordinary values (ca only)

Codes (ca only) report(variables) report coding of crossing variables report(crossed) report coding of crossed variables report(all) report coding of crossing and crossed variables length(min) use minimal length unique codes of crossing variables length(#) use # as coding length of crossing variables

Reporting ddimensions(#) number of singular values to be displayed; default is ddim(.) norowpoints suppress table with row category statistics nocolpoints suppress table with column category statistics compact display tables in a compact format plot plot the row and column coordinates maxlength(#) maximum number of characters for labels; default is maxlength(12) -------------------------------------------------------------------------

nopts Description ------------------------------------------------------------------------- symmetric symmetric coordinates (canonical); the default standard row and column standard coordinates row row principal, column standard coordinates column column principal, row standard coordinates principal row and column principal coordinates # power 0 <= # <= 1 for row coordinates; seldom used -------------------------------------------------------------------------

bootstrap, by, jackknife, rolling, and statsby are allowed with ca; see prefix. However, bootstrap and jackknife results should be interpreted with caution; identification of the ca parameters involves data-dependent restrictions, possibly leading to badly biased and overdispersed estimates (Milan and Whittaker 1995). Weights are not allowed with the bootstrap prefix. aweights are not allowed with the jackknife prefix. fweights, aweights, and iweights are allowed with ca; see weight. See [MV] ca postestimation for features available after estimation.

Menu

ca

Statistics > Multivariate analysis > Correspondence analysis > Two-way correspondence analysis (CA)

camat

Statistics > Multivariate analysis > Correspondence analysis > Two-way correspondence analysis of a matrix

Description

ca performs a simple correspondence analysis (CA) and optionally creates a biplot of two categorical variables or multiple crossed variables. camat is similar to ca but is for use with a matrix containing cross-tabulations or other nonnegative values with strictly increasing margins.

Options

+---------+ ----+ Model 2 +----------------------------------------------------------

dimensions(#) specifies the number of dimensions (= factors = axes) to be extracted. The default is dimensions(2). If you specify dimensions(1), the row and column categories are placed on one dimension. # should be strictly smaller than the number of rows and the number of columns, counting only the active rows and columns, excluding supplementary rows and columns (see options rowsupp() and colsupp()).

CA is a hierarchical method, so that extracting more dimensions does not affect the coordinates and decomposition of inertia of dimensions already included. The percentages of inertia accounting for the dimensions are in decreasing order as indicated by singular values. The first dimension accounts for the most inertia, followed by the second dimension, and then the third dimension, etc.

normalize(nopts) specifies the normalization method, that is, how the row and column coordinates are obtained from the singular vectors and singular values of the matrix of standardized residuals. See Normalization and interpretation of correspondence analysis for a discussion of these different normalization methods.

symmetric, the default, distributes the inertia equally over rows and columns, treating the rows and columns symmetrically. The symmetric normalization is also known as the standard, or canonical, normalization. This is the most common normalization when making a biplot. normalize(symmetric) is equivalent to normalize(0.5). canonical is a synonym for symmetric.

standard specifies that row and column coordinates should be in standard form (singular vectors divided by the square root of mass). This normalization method is not equivalent to normalize(#) for any #.

row specifies principal row coordinates and standard column coordinates. This option should be chosen if you want to compare row categories. Similarity of column categories should not be interpreted. The biplot interpretation of the relationship between row and column categories is appropriate. normalize(row) is equivalent to normalize(1).

column specifies principal column coordinates and standard row coordinates. This option should be chosen if you want to compare column categories. Similarity of row categories should not be interpreted. The biplot interpretation of the relationship between row and column categories is appropriate. normalize(column) is equivalent to normalize(0).

principal is the normalization to choose if you want to make comparisons among the row categories and among the column categories. In this normalization, comparing row and column points is not appropriate. Thus a biplot in this normalization is best avoided. In the principal normalization, the row and column coordinates are obtained from the left and right singular vectors, multiplied by the singular values. This normalization method is not equivalent to normalize(#) for any #.

#, 0 < # < 1, is seldom used; it specifies that the row coordinates are obtained as the left singular vectors multiplied by the singular values to the power #, whereas the column coordinates equal the right singular vectors multiplied by the singular values to the power 1-#.

rowsupp(matname_r) specifies a matrix of supplementary rows. matname_r should have n_c columns. The row names of matname_r are used for labeling. Supplementary rows do not affect the computation of the dimensions and the decomposition of inertia. They are, however, included in the plots and in the table with statistics of the row points. Because supplementary points do not contribute to the dimensions, their entries under the column labeled contrib are left blank.

colsupp(matname_c) specifies a matrix of supplementary columns. matname_c should have n_r rows. The column names of matname_c are used for labeling. Supplementary columns do not affect the computation of the dimensions and the decomposition of inertia. They are, however, included in the plots and in the table with statistics of the column points. Because supplementary points do not contribute to the dimensions, their entries under the column labeled contrib are left blank.

rowname(string) specifies a label to refer to the rows of the matrix. The default is rowname(rowvar) for ca and rowname(rows) for camat.

colname(string) specifies a label to refer to the columns of the matrix. The default is colname(colvar) for ca and colname(columns) for camat.

missing, allowed only with ca, treats missing values of rowvar and colvar as ordinary categories to be included in the analysis. Observations with missing values are omitted from the analysis by default.

+-------+ ----+ Codes +------------------------------------------------------------

report(opt) displays coding information for the crossing variables, crossed variables, or both. report() is ignored if you do not specify at least one crossed variable.

report(variables) displays the coding schemes of the crossing variables, that is, the variables used to define the crossed variables.

report(crossed) displays a table explaining the value labels of the crossed variables.

report(all) displays the codings of the crossing and crossed variables.

length(opt) specifies the coding length of crossing variables.

length(min) specifies that the minimal-length unique codes of crossing variables be used.

length(#) specifies that the coding length # of crossing variables be used, where # must be between 4 and 32.

+-----------+ ----+ Reporting +--------------------------------------------------------

ddimensions(#) specifies the number of singular values to be displayed. The default is ddimensions(.), meaning all.

norowpoints suppresses the table with row point (category) statistics.

nocolpoints suppresses the table with column point (category) statistics.

compact specifies that the table with point statistics be displayed multiplied by 1,000 as proposed by Greenacre (2007), enabling the display of more columns without wrapping output. The compact tables can be displayed without wrapping for models with two dimensions at line size 79 and with three dimensions at line size 99.

plot displays a plot of the row and column coordinates in two dimensions. With row principal normalization, only the row points are plotted. With column principal normalization, only the column points are plotted. In the other normalizations, both row and column points are plotted. You can use cabiplot directly if you need another selection of points to be plotted or if you want to otherwise refine the plot; see [MV] ca postestimation plots.

maxlength(#) specifies the maximum number of characters for row and column labels in plots. The default is maxlength(12).

Note: The reporting options may be specified during estimation or replay.

Remarks

Normalization and interpretation of CA

The normalization method used in the CA determines whether and how the similarity of the row categories, the similarity of the column categories, and the relationship (association) between the row and column variables can be interpreted in terms of the row and column coordinates and the origin of the plot.

How does one compare row points -- provided that the normalization method allows such a comparison? Formally, the Euclidean distance between the row points approximates the chi-squared distances between the corresponding row profiles. Thus, in the biplot, row categories mapped close together have similar row profiles; that is, the distributions on the column variable are similar. Row categories mapped widely apart have dissimilar row profiles. Moreover, the Euclidean distance between a row point and the origin approximates the chi-squared distance from the row profile and the row centroid, so it indicates how different a category is from the population.

An analogous interpretation applies to column points.

For the association between the row and column variables: In the CA biplot, one should not interpret the distance between a row point r and a column point c as the relationship of r and c. Instead, think in terms of the vectors origin to r (OR) and origin to c (OC). Remember that CA decomposes scaled deviations d(r,c) from independence, and d(r,c) is approximated by the inner product of OR and OC. The larger the absolute value of d(r,c), the stronger the association between r and c. In geometric terms, d(r,c) can be written as the product of the length of OR, the length of OC, and the cosine of the angle between OR and OC.

What does this mean? First, consider the effects of the angle. The association in (r,c) is strongly positive if OR and OC point in roughly the same direction; the frequency of (r,c) is much higher than expected under independence, and so r tends to flock together with c--if the points r and c are close together. Similarly, the association is strongly negative if OR and OC point in opposite directions. Here, the frequency of (r,c) is much lower than expected under independence, and so r and c are unlikely to occur simultaneously. Finally, if OR and OC are roughly orthogonal (angle = +/- 90), the deviation from independence is small.

Second, the association of r and c increases with the lengths of OR and OC. Points far from the origin tend to have large associations. If a category is mapped close to the origin, all its associations with categories of the other variable are small: its distribution resembles the marginal distribution.

Here are the interpretations enabled by the main normalization methods as specified in the normalize() option.

------------------------------------------------------ similarity similarity association method row cat. column cat. row vs column ------------------------------------------------------ symmetric no no yes principal yes yes no row yes no yes column no yes yes ------------------------------------------------------

If we say that a comparison between row categories or between column categories is not possible, we really mean that the chi-squared distance between row profiles or column profiles is actually approximated by a weighted Euclidean distance between the respective plots in which the weights depend on the inertia of the dimensions rather than on the standard Euclidean distance.

You may want to do a CA in principal normalization to study the relationship between the categories of a variable and do a CA in symmetric normalization to study the association of the row and column categories.

Examples with ca

ca creates the two-way frequency table from individual-level data and performs a CA of this table.

. webuse ca_smoking . ca rank smoking . ca rank smoking, dim(3)

We want to include the distribution of smoking, estimated in a national sample, in the analysis. The data for supplementary points are entered as a row vector with one row and four columns, one for each smoking category:

. matrix SR = (42, 29, 20, 9) . matrix rownames SR = national . ca rank smoking, rowsupp(SR) plot

Example with ca with crossed variables

You want to analyze how gender and education affect response to the statement "We believe too often in science, and not enough in feelings or faith," coded in variable A, which has five categories, with 1 indicating strong agreement and 5 indicating strong disagreement. Variable sex contains information on gender (two categories), and variable edu contains information on education (six categories). We think of the variables sex and edu as a demographic classification with 2x6=12 categories. ca performs a CA of the 5x12 frequency table:

. webuse issp93 . label language short . ca A (demo : sex edu), dim(2) report(c) length(min)

Example with camat

To conduct a CA of data in tabular format it is convenient to store the data in a Stata matrix and to use camat instead of ca. Consider this table:

------------------------------------------------ | smoking personnel | none light medium heavy ----------------+------------------------------- senior manager | 4 2 3 2 junior manager | 4 3 7 4 senior employee | 25 10 12 4 junior employee | 18 24 33 13 secretary | 10 6 7 2 ------------------------------------------------

The following code creates a Stata matrix F with the frequencies and with the appropriate row and column names.

. matrix F = ( 4,2,3,2 \ 4,3,7,4 \ 25,10,12,4 \ 18,24,33,13 \ 10,6,7,2 ) . matrix colnames F = none light medium heavy . matrix rownames F = sen_mngr jun_mngr sen_empl jun_employ secr

To conduct the CA with two dimensions (the default) and produce a plot, invoke camat on F.

. camat F, rowname(rank) colname(smoking) plot

We add two supplementary columns with the distributions among drinking and nondrinking subjects. We create a matrix with five rows (one for each staff category) and two columns.

. matrix SC = ( 0,11 \ 1,17 \ 5,46 \ 10,78 \ 7,18) . matrix colnames SC = nondrink drink

. camat F, rowsupp(SR) colsupp(SC) plot

Stored results

Let r be the number of rows, c be the number of columns, and f be the number of retained dimensions. ca and camat store the following in e():

Scalars e(N) number of observations e(f) number of dimensions (factors, axes); maximum of min(r - 1,c - 1) e(inertia) total inertia = e(X2)/e(N) e(pinertia) inertia explained by e(f) dimensions e(X2) chi-squared statistic e(X2_df) degrees of freedom (r - 1)(c - 1) e(X2_p) p-value for e(X2)

Macros e(cmd) ca (even for camat) e(cmdline) command as typed e(Rcrossvars) row crossing variable names (ca only) e(Ccrossvars) column crossing variable names (ca only) e(varlist) the row and column variable names (ca only) e(wtype) weight type (ca only) e(wexp) weight expression (ca only) e(title) title in estimation output e(ca_data) variables or crossed e(Cname) name for columns e(Rname) name for rows e(norm) normalization method e(sv_unique) 1 if the singular values are unique, 0 otherwise e(properties) nob noV eigen e(estat_cmd) program used to implement estat e(predict) program used to implement predict e(marginsnotok) predictions disallowed by margins

Matrices e(Ccoding) column categories (1 x c) (ca only) e(Rcoding) row categories (1 x r) (ca only) e(GSC) column statistics (c x 3(1 + f)) e(GSR) row statistics (r x 3(1 + f)) e(TC) normalized column coordinates (c x f) e(TR) normalized row coordinates (r x f) e(Sv) singular values (1 x f) e(C) column coordinates (c x f) e(R) row coordinates (r x f) e(c) column mass (margin) (c x 1) e(r) row mass (margin) (r x 1) e(P) analyzed matrix (r x c) e(GSC_supp) supplementary column statistics e(GSR_supp) supplementary row statistics e(PC_supp) principal coordinates supplementary column points e(PR_supp) principal coordinates supplementary row points e(TC_supp) normalized coordinates supplementary column points e(TR_supp) normalized coordinates supplementary row points

Functions e(sample) marks estimation sample (ca only)

References

Greenacre, M. J. 1984. Theory and Applications of Correspondence Analysis. London:Academic Press.

------. 2007. Correspondence Analysis in Practice. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC.

Milan, L., and J. Whittaker. 1995. Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics 44: 31-49.


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index