**[MV] ca** -- Simple correspondence analysis

__Syntax__

Simple correspondence analysis of two categorical variables

**ca** *rowvar* *colvar* [*if*] [*in*] [*weight*] [**,** *options*]

Simple correspondence analysis with crossed (stacked) variables

**ca** *row_spec* *col_spec* [*if*] [*in*] [*weight*] [**,** *options*]

Simple correspondence analysis of an *n_r* x *n_c* matrix

**camat** *matname* [**,** *options*]

where
*spec* = *varname* | **(***newvar* **:** *varlist***)**

and *matname* is an *n_r* x *n_c* matrix with *n_r*, *n_c* __>__ 2.

*options* Description
-------------------------------------------------------------------------
Model 2
__dim__**ensions(***#***)** number of dimensions (factors, axes); default is
**dim(2)**
__norm__**alize(***nopts***)** normalization of row and column coordinates
__rows__**upp(***matname_r***)** matrix of supplementary rows
__cols__**upp(***matname_c***)** matrix of supplementary columns
__rown__**ame(***string***)** label for rows
__coln__**ame(***string***)** label for columns
__mis__**sing** treat missing values as ordinary values (**ca** only)

Codes (**ca** only)
__rep__**ort(**__v__**ariables)** report coding of crossing variables
__rep__**ort(**__c__**rossed)** report coding of crossed variables
__rep__**ort(**__a__**ll)** report coding of crossing and crossed variables
__len__**gth(**__m__**in)** use minimal length unique codes of crossing
variables
__len__**gth(***#***)** use *#* as coding length of crossing variables

Reporting
__ddim__**ensions(***#***)** number of singular values to be displayed;
default is **ddim(.)**
__norowp__**oints** suppress table with row category statistics
__nocolp__**oints** suppress table with column category statistics
__comp__**act** display tables in a compact format
**plot** plot the row and column coordinates
__max__**length(***#***)** maximum number of characters for labels; default
is **maxlength(12)**
-------------------------------------------------------------------------

*nopts* Description
-------------------------------------------------------------------------
__sy__**mmetric** symmetric coordinates (__ca__**nonical**); the default
__st__**andard** row and column standard coordinates
__ro__**w** row principal, column standard coordinates
__co__**lumn** column principal, row standard coordinates
__pr__**incipal** row and column principal coordinates
*#* power **0** <= *#* <= **1** for row coordinates; seldom
used
-------------------------------------------------------------------------

**bootstrap**, **by**, **jackknife**, **rolling**, and **statsby** are allowed with **ca**; see
prefix. However, **bootstrap** and **jackknife** results should be interpreted
with caution; identification of the **ca** parameters involves
data-dependent restrictions, possibly leading to badly biased and
overdispersed estimates (Milan and Whittaker 1995).
Weights are not allowed with the **bootstrap** prefix.
**aweight**s are not allowed with the **jackknife** prefix.
**fweight**s, **aweight**s, and **iweight**s are allowed with **ca**; see weight.
See **[MV] ca postestimation** for features available after estimation.

__Menu__

__ca__

**Statistics > Multivariate analysis > Correspondence analysis >**
**Two-way correspondence analysis (CA)**

__camat__

**Statistics > Multivariate analysis > Correspondence analysis >**
**Two-way correspondence analysis of a matrix**

__Description__

**ca** performs a simple correspondence analysis (CA) and optionally creates
a biplot of two categorical variables or multiple crossed variables.
**camat** is similar to **ca** but is for use with a matrix containing
cross-tabulations or other nonnegative values with strictly increasing
margins.

__Options__

+---------+
----+ Model 2 +----------------------------------------------------------

**dimensions(***#***)** specifies the number of dimensions (= factors = axes) to be
extracted. The default is **dimensions(2)**. If you specify
**dimensions(1)**, the row and column categories are placed on one
dimension. *#* should be strictly smaller than the number of rows and
the number of columns, counting only the active rows and columns,
excluding supplementary rows and columns (see options **rowsupp()** and
**colsupp()**).

CA is a hierarchical method, so that extracting more dimensions does
not affect the coordinates and decomposition of inertia of dimensions
already included. The percentages of inertia accounting for the
dimensions are in decreasing order as indicated by singular values.
The first dimension accounts for the most inertia, followed by the
second dimension, and then the third dimension, etc.

**normalize(***nopts***)** specifies the normalization method, that is, how the row
and column coordinates are obtained from the singular vectors and
singular values of the matrix of standardized residuals. See
*Normalization and interpretation of correspondence analysis* for a
discussion of these different normalization methods.

**symmetric**, the default, distributes the inertia equally over rows and
columns, treating the rows and columns symmetrically. The
symmetric normalization is also known as the standard, or
canonical, normalization. This is the most common normalization
when making a biplot. **normalize(symmetric)** is equivalent to
**normalize(0.5)**. __ca__**nonical** is a synonym for **symmetric**.

**standard** specifies that row and column coordinates should be in
standard form (singular vectors divided by the square root of
mass). This normalization method is not equivalent to
**normalize(***#***)** for any *#*.

**row** specifies principal row coordinates and standard column
coordinates. This option should be chosen if you want to compare
row categories. Similarity of column categories should not be
interpreted. The biplot interpretation of the relationship
between row and column categories is appropriate. **normalize(row)**
is equivalent to **normalize(1)**.

**column** specifies principal column coordinates and standard row
coordinates. This option should be chosen if you want to compare
column categories. Similarity of row categories should not be
interpreted. The biplot interpretation of the relationship
between row and column categories is appropriate.
**normalize(column)** is equivalent to **normalize(0)**.

**principal** is the normalization to choose if you want to make
comparisons among the row categories and among the column
categories. In this normalization, comparing row and column
points is not appropriate. Thus a biplot in this normalization
is best avoided. In the principal normalization, the row and
column coordinates are obtained from the left and right singular
vectors, multiplied by the singular values. This normalization
method is not equivalent to **normalize(***#***)** for any *#*.

*#*, **0** __<__ *#* __<__ **1**, is seldom used; it specifies that the row coordinates
are obtained as the left singular vectors multiplied by the
singular values to the power *#*, whereas the column coordinates
equal the right singular vectors multiplied by the singular
values to the power 1-*#*.

**rowsupp(***matname_r***)** specifies a matrix of supplementary rows. *matname_r*
should have *n_c* columns. The row names of *matname_r* are used for
labeling. Supplementary rows do not affect the computation of the
dimensions and the decomposition of inertia. They are, however,
included in the plots and in the table with statistics of the row
points. Because supplementary points do not contribute to the
dimensions, their entries under the column labeled **contrib** are left
blank.

**colsupp(***matname_c***)** specifies a matrix of supplementary columns.
*matname_c* should have *n_r* rows. The column names of *matname_c* are
used for labeling. Supplementary columns do not affect the
computation of the dimensions and the decomposition of inertia. They
are, however, included in the plots and in the table with statistics
of the column points. Because supplementary points do not contribute
to the dimensions, their entries under the column labeled **contrib** are
left blank.

**rowname(***string***)** specifies a label to refer to the rows of the matrix.
The default is **rowname(rowvar)** for **ca** and **rowname(rows)** for **camat**.

**colname(***string***)** specifies a label to refer to the columns of the matrix.
The default is **colname(colvar)** for **ca** and **colname(columns)** for **camat**.

**missing**, allowed only with **ca**, treats missing values of *rowvar* and *colvar*
as ordinary categories to be included in the analysis. Observations
with missing values are omitted from the analysis by default.

+-------+
----+ Codes +------------------------------------------------------------

**report(***opt***)** displays coding information for the crossing variables,
crossed variables, or both. **report()** is ignored if you do not
specify at least one crossed variable.

**report(variables)** displays the coding schemes of the crossing
variables, that is, the variables used to define the crossed
variables.

**report(crossed)** displays a table explaining the value labels of the
crossed variables.

**report(all)** displays the codings of the crossing and crossed
variables.

**length(***opt***)** specifies the coding length of crossing variables.

**length(min)** specifies that the minimal-length unique codes of
crossing variables be used.

**length(***#***)** specifies that the coding length *#* of crossing variables be
used, where *#* must be between 4 and 32.

+-----------+
----+ Reporting +--------------------------------------------------------

**ddimensions(***#***)** specifies the number of singular values to be displayed.
The default is **ddimensions(.)**, meaning all.

**norowpoints** suppresses the table with row point (category) statistics.

**nocolpoints** suppresses the table with column point (category) statistics.

**compact** specifies that the table with point statistics be displayed
multiplied by 1,000 as proposed by Greenacre (2007), enabling the
display of more columns without wrapping output. The compact tables
can be displayed without wrapping for models with two dimensions at
line size 79 and with three dimensions at line size 99.

**plot** displays a plot of the row and column coordinates in two dimensions.
With row principal normalization, only the row points are plotted.
With column principal normalization, only the column points are
plotted. In the other normalizations, both row and column points are
plotted. You can use **cabiplot** directly if you need another selection
of points to be plotted or if you want to otherwise refine the plot;
see **[MV] ca postestimation plots**.

**maxlength(***#***)** specifies the maximum number of characters for row and
column labels in plots. The default is **maxlength(12)**.

Note: The reporting options may be specified during estimation or replay.

__Remarks__

__Normalization and interpretation of CA__

The normalization method used in the CA determines whether and how the
similarity of the row categories, the similarity of the column
categories, and the relationship (association) between the row and column
variables can be interpreted in terms of the row and column coordinates
and the origin of the plot.

How does one compare row points -- provided that the normalization method
allows such a comparison? Formally, the Euclidean distance between the
row points approximates the chi-squared distances between the
corresponding row profiles. Thus, in the biplot, row categories mapped
close together have similar row profiles; that is, the distributions on
the column variable are similar. Row categories mapped widely apart have
dissimilar row profiles. Moreover, the Euclidean distance between a row
point and the origin approximates the chi-squared distance from the row
profile and the row centroid, so it indicates how different a category is
from the population.

An analogous interpretation applies to column points.

For the association between the row and column variables: In the CA
biplot, one should not interpret the distance between a row point r and a
column point c as the relationship of r and c. Instead, think in terms
of the vectors origin to r (OR) and origin to c (OC). Remember that CA
decomposes scaled deviations d(r,c) from independence, and d(r,c) is
approximated by the inner product of OR and OC. The larger the absolute
value of d(r,c), the stronger the association between r and c. In
geometric terms, d(r,c) can be written as the product of the length of
OR, the length of OC, and the cosine of the angle between OR and OC.

What does this mean? First, consider the effects of the angle. The
association in (r,c) is strongly positive if OR and OC point in roughly
the same direction; the frequency of (r,c) is much higher than expected
under independence, and so r tends to flock together with c--if the
points r and c are close together. Similarly, the association is
strongly negative if OR and OC point in opposite directions. Here, the
frequency of (r,c) is much lower than expected under independence, and so
r and c are unlikely to occur simultaneously. Finally, if OR and OC are
roughly orthogonal (angle = +/- 90), the deviation from independence is
small.

Second, the association of r and c increases with the lengths of OR and
OC. Points far from the origin tend to have large associations. If a
category is mapped close to the origin, all its associations with
categories of the other variable are small: its distribution resembles
the marginal distribution.

Here are the interpretations enabled by the main normalization methods as
specified in the **normalize()** option.

------------------------------------------------------
similarity similarity association
method row cat. column cat. row vs column
------------------------------------------------------
**symmetric** no no yes
**principal** yes yes no
**row** yes no yes
**column** no yes yes
------------------------------------------------------

If we say that a comparison between row categories or between column
categories is not possible, we really mean that the chi-squared distance
between row profiles or column profiles is actually approximated by a
weighted Euclidean distance between the respective plots in which the
weights depend on the inertia of the dimensions rather than on the
standard Euclidean distance.

You may want to do a CA in principal normalization to study the
relationship between the categories of a variable and do a CA in
symmetric normalization to study the association of the row and column
categories.

__Examples with ca__

**ca** creates the two-way frequency table from individual-level data and
performs a CA of this table.

**. webuse ca_smoking**
**. ca rank smoking**
**. ca rank smoking, dim(3)**

We want to include the distribution of smoking, estimated in a national
sample, in the analysis. The data for supplementary points are entered
as a row vector with one row and four columns, one for each smoking
category:

**. matrix SR = (42, 29, 20, 9)**
**. matrix rownames SR = national**
**. ca rank smoking, rowsupp(SR) plot**

__Example with ca with crossed variables__

You want to analyze how gender and education affect response to the
statement "We believe too often in science, and not enough in feelings or
faith," coded in variable **A**, which has five categories, with 1 indicating
strong agreement and 5 indicating strong disagreement. Variable **sex**
contains information on gender (two categories), and variable **edu**
contains information on education (six categories). We think of the
variables **sex** and **edu** as a demographic classification with 2x6=12
categories. **ca** performs a CA of the 5x12 frequency table:

**. webuse issp93**
**. label language short**
**. ca A (demo : sex edu), dim(2) report(c) length(min)**

__Example with camat__

To conduct a CA of data in tabular format it is convenient to store the
data in a Stata matrix and to use **camat** instead of **ca**. Consider this
table:

------------------------------------------------
| smoking
personnel | none light medium heavy
----------------+-------------------------------
senior manager | 4 2 3 2
junior manager | 4 3 7 4
senior employee | 25 10 12 4
junior employee | 18 24 33 13
secretary | 10 6 7 2
------------------------------------------------

The following code creates a Stata matrix **F** with the frequencies and with
the appropriate row and column names.

**. matrix F = ( 4,2,3,2 \ 4,3,7,4 \ 25,10,12,4 \ 18,24,33,13 \**
**10,6,7,2 )**
**. matrix colnames F = none light medium heavy**
**. matrix rownames F = sen_mngr jun_mngr sen_empl jun_employ secr**

To conduct the CA with two dimensions (the default) and produce a plot,
invoke **camat** on **F**.

**. camat F, rowname(rank) colname(smoking) plot**

We add two supplementary columns with the distributions among drinking
and nondrinking subjects. We create a matrix with five rows (one for
each staff category) and two columns.

**. matrix SC = ( 0,11 \ 1,17 \ 5,46 \ 10,78 \ 7,18)**
**. matrix colnames SC = nondrink drink**

**. camat F, rowsupp(SR) colsupp(SC) plot**

__Stored results__

Let *r* be the number of rows, *c* be the number of columns, and *f* be the
number of retained dimensions. **ca** and **camat** store the following in **e()**:

Scalars
**e(N)** number of observations
**e(f)** number of dimensions (factors, axes); maximum of
min(*r* - 1,*c* - 1)
**e(inertia)** total inertia = **e(X2)**/**e(N)**
**e(pinertia)** inertia explained by **e(f)** dimensions
**e(X2)** chi-squared statistic
**e(X2_df)** degrees of freedom (*r* - 1)(*c* - 1)
**e(X2_p)** *p*-value for **e(X2)**

Macros
**e(cmd)** **ca** (even for **camat**)
**e(cmdline)** command as typed
**e(Rcrossvars)** row crossing variable names (**ca** only)
**e(Ccrossvars)** column crossing variable names (**ca** only)
**e(varlist)** the row and column variable names (**ca** only)
**e(wtype)** weight type (**ca** only)
**e(wexp)** weight expression (**ca** only)
**e(title)** title in estimation output
**e(ca_data)** **variables** or **crossed**
**e(Cname)** name for columns
**e(Rname)** name for rows
**e(norm)** normalization method
**e(sv_unique)** **1** if the singular values are unique, **0** otherwise
**e(properties)** **nob noV eigen**
**e(estat_cmd)** program used to implement **estat**
**e(predict)** program used to implement **predict**
**e(marginsnotok)** predictions disallowed by **margins**

Matrices
**e(Ccoding)** column categories (1 x *c*) (**ca** only)
**e(Rcoding)** row categories (1 x *r*) (**ca** only)
**e(GSC)** column statistics (*c* x 3(1 + *f*))
**e(GSR)** row statistics (*r* x 3(1 + *f*))
**e(TC)** normalized column coordinates (*c* x *f*)
**e(TR)** normalized row coordinates (*r* x *f*)
**e(Sv)** singular values (1 x *f*)
**e(C)** column coordinates (*c* x *f*)
**e(R)** row coordinates (*r* x *f*)
**e(c)** column mass (margin) (*c* x 1)
**e(r)** row mass (margin) (*r* x 1)
**e(P)** analyzed matrix (*r* x *c*)
**e(GSC_supp)** supplementary column statistics
**e(GSR_supp)** supplementary row statistics
**e(PC_supp)** principal coordinates supplementary column points
**e(PR_supp)** principal coordinates supplementary row points
**e(TC_supp)** normalized coordinates supplementary column points
**e(TR_supp)** normalized coordinates supplementary row points

Functions
**e(sample)** marks estimation sample (**ca** only)

__References__

Greenacre, M. J. 1984. *Theory and Applications of Correspondence*
*Analysis*. London:Academic Press.

------. 2007. *Correspondence Analysis in Practice*. 2nd ed. Boca Raton,
FL: Chapman & Hall/CRC.

Milan, L., and J. Whittaker. 1995. Application of the parametric
bootstrap to models that incorporate a singular value decomposition.
*Applied Statistics* 44: 31-49.