help ca, help camat dialogs: ca camat
also see: ca postestimation
-------------------------------------------------------------------------------
Title
[MV] ca -- Simple correspondence analysis
Syntax
Simple correspondence analysis of two categorical variables
ca rowvar colvar [if] [in] [weight] [, options]
Simple correspondence analysis with crossed (stacked) variables
ca row_spec col_spec [if] [in] [weight] [, options]
Simple correspondence analysis of an n_r x n_c matrix
camat matname [, options]
where
spec = varname | (newvar : varlist)
options description
-------------------------------------------------------------------------
Model 2
dimensions(#) number of dimensions (factors, axes); default is
dim(2)
normalize(nopts) normalization of row and column coordinates
rowsupp(matname_r) matrix of supplementary rows
colsupp(matname_c) matrix of supplementary columns
missing treat missing values as ordinary values (ca only)
rowname(string) label for rows
colname(string) label for columns
Codes (ca only)
report(variables) report coding of crossing variables
report(crossed) report coding of crossed variables
report(all) report coding of crossing and crossed variables
length(min) use minimal length unique codes of crossing
variables
length(#) use # as coding length of crossing variables
Reporting
ddimensions(#) number of singular values to be displayed;
default is ddim(.)
norowpoints suppress table with row category statistics
nocolpoints suppress table with column category statistics
compact display tables in a compact format
plot plot the row and column coordinates
maxlength(#) maximum number of characters for labels; default
is maxlength(12)
-------------------------------------------------------------------------
nopts description
-------------------------------------------------------------------------
symmetric symmetric coordinates (canonical); the default
standard row and column standard coordinates
row row principal, column standard coordinates
column column principal, row standard coordinates
principal row and column principal coordinates
# power 0 <= # <= 1 for row coordinates; seldom
used
-------------------------------------------------------------------------
bootstrap, by, jackknife, rolling, and statsby are allowed with ca; see
prefix. However, bootstrap and jackknife results should be interpreted
with caution; identification of the ca parameters involves
data-dependent restrictions, possibly leading to badly biased and
overdispersed estimates.
Weights are not allowed with the bootstrap prefix.
aweights are not allowed with the jackknife prefix.
fweights, aweights, and iweights are allowed with ca; see weight.
See [MV] ca postestimation for features available after estimation.
Menu
ca
Statistics > Multivariate analysis > Correspondence analysis >
Two-way correspondence analysis (CA)
camat
Statistics > Multivariate analysis > Correspondence analysis >
Two-way correspondence analysis of a matrix
Description
ca performs a simple correspondence analysis (CA) of the cross-tabulation
of the integer-valued variables rowvar and colvar with n_r and n_c
categories with n_r, n_c > 2. CA is formally equivalent to various other
geometric approaches, including dual scaling, reciprocal averaging, and
canonical correlation analysis of contingency tables.
camat performs a simple CA of an n_r x n_c matrix matname having
nonnegative entries and strictly positive margins. The correspondence
table need not contain frequencies. The labels for the row and column
categories are obtained from the matrix row and column names.
Optionally, a CA biplot may be produced. The biplot displays the row and
column coordinates within the same two-dimensional graph.
Results may be replayed using ca or camat; there is no difference.
Options
+---------+
----+ Model 2 +----------------------------------------------------------
dimensions(#) specifies the number of dimensions (= factors = axes) to be
extracted. The default is dimensions(2). If you specify
dimensions(1), the row and column categories are placed on one
dimension. # should be strictly smaller than the number of rows and
the number of columns, counting only the active rows and columns,
excluding supplementary rows and columns (see options rowsupp() and
colsupp()).
CA is a hierarchical method so that extracting more dimensions does
not affect the coordinates and decomposition of inertia of dimensions
already included. The percentages of inertia accounting for the
dimensions are in decreasing order as indicated by singular values.
The first dimension accounts for the most inertia, followed by the
second dimension, and then the third dimension, etc.
normalize(nopts) specifies the normalization method, i.e., how the row
and column coordinates are obtained from the singular vectors and
singular values of the matrix of standardized residuals. See
Normalization and interpretation of correspondence analysis for a
discussion of these different normalization methods.
symmetric, the default, distributes the inertia equally over rows and
columns, treating the rows and columns symmetrically. The
symmetric normalization is also known as the standard, or
canonical, normalization. This is the most common normalization
when making a biplot. normalize(symmetric) is equivalent to
normalize(0.5). canonical is a synonym for symmetric.
standard specifies that row and column coordinates should be in
standard form (singular vectors divided by the square root of
mass). This normalization method is not equivalent to
normalize(#) for any #.
row specifies principal row coordinates and standard column
coordinates. This option should be chosen if you want to compare
row categories. Similarity of column categories should not be
interpreted. The biplot interpretation of the relationship
between row and column categories is appropriate. normalize(row)
is equivalent to normalize(1).
column specifies principal column coordinates and standard row
coordinates. This option should be chosen if you want to compare
column categories. Similarity of row categories should not be
interpreted. The biplot interpretation of the relationship
between row and column categories is appropriate.
normalize(column) is equivalent to normalize(0).
principal is the normalization to choose if you want to make
comparisons among the row categories and among the column
categories. In this normalization, comparing row and column
points is not appropriate. Thus a biplot in this normalization
is best avoided. In the principal normalization, the row and
column coordinates are obtained from the left and right singular
vectors, multiplied by the singular values. This normalization
method is not equivalent to normalize(#) for any #.
#, 0 < # < 1, is seldom used; it specifies that the row coordinates
are obtained as the left singular vectors multiplied by the
singular values to the power #, whereas the column coordinates
equal the right singular vectors multiplied by the singular
values to the power 1-#.
rowsupp(matname_r) specifies a matrix of supplementary rows. matname_r
should have n_c columns. The row names of matname_r are used for
labeling. Supplementary rows do not affect the computation of the
dimensions and the decomposition of inertia. They are, however,
included in the plots and in the table with statistics of the row
points. Because supplementary points do not contribute to the
dimensions, their entries under the column labeled contrib are left
blank.
colsupp(matname_c) specifies a matrix of supplementary columns.
matname_c should have n_r rows. The column names of matname_c are
used for labeling. Supplementary columns do not affect the
computation of the dimensions and the decomposition of inertia. They
are, however, included in the plots and in the table with statistics
of the column points. Because supplementary points do not contribute
to the dimensions, the contrib entries are left blank.
missing, allowed only with ca, treats missing values of rowvar and colvar
as ordinary categories to be included in the analysis. Observations
with missing values are omitted from the analysis by default.
rowname(string) specifies a label to refer to the rows of the matrix.
The default is rowname(rowvar) for ca and rowname(rows) for camat.
colname(string) specifies a label to refer to the columns of the matrix.
The default is colname(colvar) for ca and colname(columns) for camat.
+-------+
----+ Codes +------------------------------------------------------------
report(opt) displays coding information for the crossing variables,
crossed variables, or both. report() is ignored if you do not
specify at least one crossed variable.
report(variables) displays the coding schemes of the crossing
variables, i.e., the variables used to define the crossed
variables.
report(crossed) displays a table explaining the value labels of the
crossed variables.
report(all) displays the codings of the crossing and crossed
variables.
length(opt) specifies the coding length of crossing variables.
length(min) specifies that the minimal-length unique codes of
crossing variables be used.
length(#) specifies that the coding length # of crossing variables be
used, where # must be between 4 and 32.
+-----------+
----+ Reporting +--------------------------------------------------------
ddimensions(#) specifies the number of singular values to be displayed.
The default is ddimensions(.), meaning all.
norowpoints suppresses the table with row point (category) statistics.
nocolpoints suppresses the table with column point (category) statistics.
compact specifies that the table with point statistics be displayed
multiplied by 1,000, enabling the display of more columns without
wrapping output. The compact tables can be displayed without
wrapping for models with two dimensions at line size 79 and with
three dimensions at line size 99.
plot displays a plot of the row and column coordinates in two dimensions.
With row principal normalization, only the row points are plotted.
With column principal normalization, only the column points are
plotted. In the other normalizations, both row and column points are
plotted. You can use cabiplot directly if you need another selection
of points to be plotted or if you want to otherwise refine the plot;
see [MV] ca postestimation.
maxlength(#) specifies the maximum number of characters for row and
column labels in plots. The default is maxlength(12).
Note: the reporting options may be specified during estimation or replay.
Remarks
Normalization and interpretation of CA
The normalization method used in the CA determines whether and how the
similarity of the row categories, the similarity of the column
categories, and the relationship (association) between the row and column
variables can be interpreted in terms of the row and column coordinates
and the origin of the plot.
How does one compare row points -- provided that the normalization method
allows such a comparison? Formally, the Euclidean distance between the
row points approximates the chi-squared distances between the
corresponding row profiles. Thus, in the biplot, row categories mapped
close together have similar row profiles; i.e., the distributions on the
column variable are similar. Row categories mapped widely apart have
dissimilar row profiles. Moreover, the Euclidean distance between a row
point and the origin approximates the chi-squared distance from the row
profile and the row centroid, so it indicates how different a category is
from the population.
An analogous interpretation applies to column points.
For the association between the row and column variables: In the CA
biplot, one should not interpret the distance between a row point r and a
column point c as the relationship of r and c. Instead, think in terms
of the vectors origin to r (OR) and origin to c (OC). Remember that CA
decomposes scaled deviations d(r,c) from independence, and d(r,c) is
approximated by the inner product of OR and OC. The larger the absolute
value of d(r,c), the stronger the association between r and c. In
geometric terms, d(r,c) can be written as the product of the length of
OR, the length of OC, and the cosine of the angle between OR and OC.
What does this mean? First, consider the effects of the angle. The
association in (r,c) is strongly positive if OR and OC point in roughly
the same direction; the frequency of (r,c) is much higher than expected
under independence, and so r tends to flock together with c--if the
points r and c are close together. Similarly, the association is
strongly negative if OR and OC point in opposite directions. Here, the
frequency of (r,c) is much lower than expected under independence, and so
r and c are unlikely to occur simultaneously. Finally, if OR and OC are
roughly orthogonal (angle = +/- 90), the deviation from independence is
small.
Second, the association of r and c increases with the lengths of OR and
OC. Points far from the origin tend to have large associations. If a
category is mapped close to the origin, all its associations with
categories of the other variable are small: its distribution resembles
the marginal distribution.
Here are the interpretations enabled by the main normalization methods as
specified in the normalize() option.
------------------------------------------------------
similarity similarity association
method row cat. column cat. row vs column
------------------------------------------------------
symmetric no no yes
principal yes yes no
row yes no yes
column no yes yes
------------------------------------------------------
If we say that a comparison between row categories or between column
categories is not possible, we really mean that the chi-squared distance
between row profiles or column profiles is actually approximated by a
weighted Euclidean distance between the respective plots in which the
weights depend on the inertia of the dimensions rather than on the
standard Euclidean distance.
You may want to do a CA in principal normalization to study the
relationship between the categories of a variable and do a CA in
symmetric normalization to study the association of the row and column
categories.
Examples with ca
ca creates the two-way frequency table from individual-level data and
performs a CA of this table.
. webuse ca_smoking
. ca rank smoking
. ca rank smoking, dim(3)
We want to include the distribution of smoking, estimated in a national
sample, in the analysis. The data for supplementary points are entered
as a row vector with one row and four columns, one for each smoking
category:
. matrix SR = (42, 29, 20, 9)
. matrix rownames SR = national
. ca rank smoking, rowsupp(SR) plot
Example with ca with crossed variables
You want to analyze how gender and education affect response to the
statement "We believe too often in science, and not enough in feelings or
faith," coded in variable A, which has five categories, with 1 indicating
strong agreement and 5 indicating strong disagreement. Variable sex
contains information on gender (two categories), and variable edu
contains information on education (six categories). We think of the
variables sex and edu as a demographic classification with 2x6=12
categories. ca performs a CA of the 5x12 frequency table:
. webuse issp93
. label language short
. ca A (demo : sex edu), dim(2) report(c) length(min)
Example with camat
To conduct a CA of data in tabular format it is convenient to store the
data in a Stata matrix and to use camat instead of ca. Consider this
table:
------------------------------------------------
| smoking
personnel | none light medium heavy
----------------+-------------------------------
senior manager | 4 2 3 2
junior manager | 4 3 7 4
senior employee | 25 10 12 4
junior employee | 18 24 33 13
secretary | 10 6 7 2
------------------------------------------------
The following code creates a Stata matrix F with the frequencies and with
the appropriate row and column names.
. matrix F = ( 4,2,3,2 \ 4,3,7,4 \ 25,10,12,4 \ 18,24,33,13 \
10,6,7,2 )
. matrix colnames F = none light medium heavy
. matrix rownames F = sen_mngr jun_mngr sen_empl jun_employ secr
To conduct the CA with two dimensions (the default) and produce a plot,
invoke camat on F.
. camat F, rowname(rank) colname(smoking) plot
We add two supplementary columns with the distributions among drinking
and nondrinking subjects. We create a matrix with five rows (one for
each staff category) and two columns.
. matrix SC = ( 0,11 \ 1,17 \ 5,46 \ 10,78 \ 7,18)
. matrix colnames SC = nondrink drink
. camat F, rowsupp(SR) colsupp(SC) plot
Saved results
Let r be the number of rows, c be the number of columns, and f be the
number of retained dimensions. ca and camat save the following in e():
Scalars
e(N) number of observations
e(f) number of dimensions (factors, axes); maximum of
min(r - 1,c - 1)
e(inertia) total inertia = e(X2)/e(N)
e(pinertia) inertia explained by e(f) dimensions
e(X2) chi-squared statistic
e(X2_df) degrees of freedom (r - 1)(c - 1)
e(X2_p) p-value for e(X2)
Macros
e(cmd) ca (even for camat)
e(cmdline) command as typed
e(Rcrossvars) row crossing variable names (ca only)
e(Ccrossvars) column crossing variable names (ca only)
e(varlist) the row and column variable names (ca only)
e(wtype) weight type (ca only)
e(wexp) weight expression (ca only)
e(title) title in estimation output
e(ca_data) variables or crossed
e(Cname) name for columns
e(Rname) name for rows
e(norm) normalization method
e(sv_unique) 1 if the singular values are unique, 0 otherwise
e(properties) nob noV eigen
e(estat_cmd) program used to implement estat
e(predict) program used to implement predict
e(marginsnotok) predictions disallowed by margins
Matrices
e(Ccoding) column categories (1 x c) (ca only)
e(Rcoding) row categories (1 x r) (ca only)
e(GSC) column statistics (c x 3(1 + f))
e(GSR) row statistics (r x 3(1 + f))
e(TC) normalized column coordinates (c x f)
e(TR) normalized row coordinates (r x f)
e(Sv) singular values (1 x f)
e(C) column coordinates (c x f)
e(R) row coordinates (r x f)
e(c) column mass (margin) (c x 1)
e(r) row mass (margin) (r x 1)
e(P) analyzed matrix (r x c)
e(GSC_supp) supplementary column statistics
e(GSR_supp) supplementary row statistics
e(PC_supp) principal coordinates supplementary column points
e(PR_supp) principal coordinates supplementary row points
e(TC_supp) normalized coordinates supplementary column points
e(TR_supp) normalized coordinates supplementary row points
Functions
e(sample) marks estimation sample (ca only)
Also see
Manual: [MV] ca
Help: [MV] ca postestimation;
[MV] mca, [R] tabulate oneway, [R] tabulate twoway, [R]
tabulate, summarize()