## Stata 15 help for mv_glossary

```
[MV] Glossary -- Glossary of terms

Description

Commonly used terms are defined here.

Glossary

agglomerative hierarchical clustering methods.  Agglomerative
hierarchical clustering methods are bottom-up methods for
hierarchical clustering.  Each observation begins in a separate
group.  The closest pair of groups is agglomerated or merged in each
iteration until all the data are in one cluster.  This process
creates a hierarchy of clusters.  Contrast to divisive hierarchical
clustering methods.

anti-image correlation matrix or anti-image covariance matrix.  The image
of a variable is defined as that part which is predictable by
regressing each variable on all the other variables; hence, the
anti-image is the part of the variable that cannot be predicted.  The
anti-image correlation matrix A is a matrix of the negatives of the
partial correlations among variables.  Partial correlations represent
the degree to which the factors explain each other in the results.
The diagonal of the anti-image correlation matrix is the
Kaiser-Meyer-Olkin measure of sampling adequacy for the individual
variables.  Variables with small values should be eliminated from the
analysis.  The anti-image covariance matrix C contains the negatives
of the partial covariances and has one minus the squared multiple
correlations in the principal diagonal.  Most of the off-diagonal
elements should be small in both anti-image matrices in a good factor
model.  Both anti-image matrices can be calculated from the inverse
of the correlation matrix R via

A = {diag(R)}^{-1} R{diag(R)}^{-1}
C = {diag(R)}^{-1/2} R{diag(R)}^{-1/2}

Also see Kaiser-Meyer-Olkin measure of sampling adequacy.

clustering method that uses the average proximity of observations
between groups as the proximity measure between the two groups.

Bayes's theorem.  Bayes's theorem states that the probability of an
event, A, conditional on another event, B, is generally different
from the probability of B conditional on A, although the two are
related.  Bayes's theorem is that

P(A|B) = {P(B|A) P(A)}/{P(B)}

where P(A) is the marginal probability of A, and P(A|B) is the
conditional probability of A given B, and likewise for P(B) and
P(B|A).

Bentler's invariant pattern simplicity rotation.  Bentler's (1977)
rotation maximizes the invariant pattern simplicity.  It is an
oblique rotation that minimizes the criterion function

c(Lambda) = -log[|(Lambda^2)'Lambda^2|] +
log[|diag{(Lambda^2)'Lambda^2}|]

See Crawford-Ferguson rotation for a definition of Lambda.  Also see
oblique rotation.

between matrix and within matrix.  The between and within matrices are
SSCP matrices that measure the spread between groups and within
groups, respectively.  These matrices are used in multivariate
analysis of variance and related hypothesis tests:  Wilks's lambda,
Roy's largest root, Lawley-Hotelling trace, and Pillai's trace.

Here we have k independent random samples of size n.  The between
matrix H is given by

H = n sum_{i=1}^k (y_{i.}[bar] - y_{..}[bar]) (y_{i.}[bar] -
y_{..}[bar])' = sum_{i=1}^k 1/n y_{i.}y_{i.}' - 1/kn
y_{..}y_{..}'

The within matrix E is defined as

E = sum_{i=1}^k sum_{j=1}^n(y_{ij} - y_{i.}[bar]) (y_{ij} -
y_{i.})' = sum_{i=1}^k sum_{j=1}^ny_{ij}y_{ij}' - sum_{i=1}^k
1/n y_{i.}y_{i.}'

Also see SSCP matrix.

biplot.  A biplot is a scatterplot which represents both observations and
variables simultaneously.  There are many different biplots;
variables in biplots are usually represented by arrows and
observations are usually represented by points.

biquartimax rotation or biquartimin rotation.  Biquartimax rotation and
biquartimin rotation are synonyms.  They put equal weight on the
varimax and quartimax criteria, simplifying the columns and rows of
the matrix.  This is an oblique rotation equivalent to an oblimin
rotation with gamma = 0.5.  Also see varimax rotation, quartimax
rotation, and oblimin rotation.

boundary solution or Heywood solution.  See Heywood case.

CA.  See correspondence analysis.

canonical correlation analysis.  Canonical correlation analysis attempts
to describe the relationships between two sets of variables by
finding linear combinations of each so that the correlation between
the linear combinations is maximized.

canonical discriminant analysis.  Canonical linear discriminant analysis
is LDA where describing how groups are separated is of primary
interest.  Also see linear discriminant analysis.

linear discriminant functions.  Also see canonical discriminant

canonical variate set.  The canonical variate set is a linear combination
or weighted sum of variables obtained from canonical correlation
analysis.  Two sets of variables are analyzed in canonical
correlation analysis.  The first canonical variate of the first
variable set is the linear combination in standardized form that has
maximal correlation with the first canonical variate from the second
variable set.  The subsequent canonical variates are uncorrelated to
the previous and have maximal correlation under that constraint.

centered data.  Centered data has zero mean.  You can center data x by
taking x - x[bar].

hierarchical clustering method that computes the proximity between
two groups as the proximity between the group means.

classical scaling.  Classical scaling is a method of performing MDS via
an eigen decomposition.  This is contrasted to modern MDS, which is
achieved via the minimization of a loss function.  Also see
multidimensional scaling and modern scaling.

classification.  Classification is the act of allocating or classifying
observations to groups as part of discriminant analysis.  In some
sources, classification is synonymous with cluster analysis.

classification function.  Classification functions can be obtained after
LDA or QDA.  They are functions based on Mahalanobis distance for
classifying observations to the groups.  See discriminant function
for an alternative.  Also see linear discriminant analysis and

classification table.  A classification table, also known as a confusion
matrix, gives the count of observations from each group that are
classified into each of the groups as part of a discriminant
analysis.  The element at (i,j) gives the number of observations that
belong to the ith group but were classified into the jth group.  High
counts are expected on the diagonal of the table where observations
are correctly classified, and small values are expected off the
diagonal.  The columns of the matrix are categories of the predicted
classification; the rows represent the actual group membership.

cluster analysis.  Cluster analysis is a method for determining natural
groupings or clusters of observations.

cluster tree.  See dendrogram.

clustering.  See cluster analysis.

common factors.  Common factors are found by factor analysis.  They
linearly reconstruct the original variables.  In factor analysis,
reconstruction is defined in terms of prediction of the correlation
matrix of the original variables.

communality.  Communality is the proportion of a variable's variance
explained by the common factors in factor analysis.  It is also "1 -
uniqueness".  Also see uniqueness.

hierarchical clustering method that uses the farthest pair of
observations between two groups to determine the proximity of the two
groups.

component scores.  Component scores are calculated after PCA.  Component
scores are the coordinates of the original variables in the space of
principal components.

Comrey's tandem 1 and 2 rotations.  Comrey (1967) describes two
rotations, the first (tandem 1) to judge which "small" factors should
be dropped, the second (tandem 2) for "polishing".

Tandem principle 1 minimizes the criterion

c(Lambda) = < Lambda^2, (LambdaLambda')^2Lambda^2>

Tandem principle 2 minimizes the criterion

c(Lambda) = < Lambda^2, {11' - (LambdaLambda')^2}Lambda^2>

See Crawford-Ferguson rotation for a definition of Lambda.

configuration.  The configuration in MDS is a representation in a
low-dimensional (usually 2-dimensional) space with distances in the
low-dimensional space approximating the dissimilarities or
disparities in high-dimensional space.  Also see multidimensional
scaling, dissimilarity, and disparity.

configuration plot.  A configuration plot after MDS is a (usually
2-dimensional) plot of labeled points showing the low-dimensional
approximation to the dissimilarities or disparities in
high-dimensional space.  Also see multidimensional scaling,
dissimilarity, and disparity.

confusion matrix.  A confusion matrix is a synonym for a classification
table after discriminant analysis.  See classification table.

contrast or contrasts.  In ANOVA, a contrast in k population means is
defined as a linear combination

delta = c_1 mu_1 + c_2 mu_2 + ... + c_k mu_k

where the coefficients satisfy

sum_{i=1}^k c_i = 0

In the multivariate setting (MANOVA), a contrast in k population mean
vectors is defined as

delta = c_1 mu_1 +c_2 mu_2 + ... c_k mu_k

where the coefficients again satisfy

sum_{i=1}^k c_i = 0

The univariate hypothesis delta = 0 may be tested with contrast (or
test) after ANOVA.  The multivariate hypothesis delta = 0 may be
tested with manovatest after MANOVA.

correspondence analysis.  Correspondence analysis (CA) gives a geometric
representation of the rows and columns of a two-way frequency table.
The geometric representation is helpful in understanding the
similarities between the categories of variables and associations
between variables. CA is calculated by singular value decomposition.
Also see singular value decomposition.

correspondence analysis projection.  A correspondence analysis projection
is a line plot of the row and column coordinates after CA.  The goal
of this graph is to show the ordering of row and column categories on
each principal dimension of the analysis.  Each principal dimension
is represented by a vertical line; markers are plotted on the lines
where the row and column categories project onto the dimensions.
Also see correspondence analysis.

costs.  Costs in discriminant analysis are the cost of misclassifying
observations.

covarimin rotation.  Covarimin rotation is an orthogonal rotation
equivalent to varimax.  Also see varimax rotation.

Crawford-Ferguson rotation.  Crawford-Ferguson (1970) rotation is a
general oblique rotation with several interesting special cases.

Special cases of the Crawford-Ferguson rotation include

kappa          Special case
------------------------------------
0              quartimax / quartimin
1/p            varimax / covarimin
f/(2p)         equamax
(f-1)/(p+f-2)  parsimax
1              factor parsimony
------------------------------------
p = number of rows of A.
f = number of columns of A.

Where A is the matrix to be rotated, T is the rotation and Lambda =
AT.  The Crawford-Ferguson rotation is achieved by minimizing the
criterion

c(Lambda) = (1-kappa)/4 <Lambda^2, Lambda^2(1 1' - I)> + kappa/4
< Lambda^2, (1 1' - I)Lambda^2>

Also see oblique rotation.

crossed variables or stacked variables.  In CA and MCA crossed
categorical variables may be formed from the interactions of two or
more existing categorical variables.  Variables that contain these
interactions are called crossed or stacked variables.

crossing variables or stacking variables.  In CA and MCA, crossing or
stacking variables are the existing categorical variables whose
interactions make up a crossed or stacked variable.

curse of dimensionality.  The curse of dimensionality is a term coined by
Richard Bellman (1961) to describe the problem caused by the
exponential increase in size associated with adding extra dimensions
to a mathematical space.  On the unit interval, 10 evenly spaced
points suffice to sample with no more distance than 0.1 between them;
however a unit square requires 100 points, and a unit cube requires
1000 points.  Many multivariate statistical procedures suffer from
the curse of dimensionality.  Adding variables to an analysis without

dendrogram or cluster tree.  A dendrogram or cluster tree graphically
presents information about how observations are grouped together at
various levels of (dis)similarity in hierarchical cluster analysis.
At the bottom of the dendrogram, each observation is considered its
own cluster.  Vertical lines extend up for each observation, and at
various (dis)similarity values, these lines are connected to the
lines from other observations with a horizontal line.  The
observations continue to combine until, at the top of the dendrogram,
all observations are grouped together. Also see hierarchical
clustering.

dilation.  A dilation stretches or shrinks distances in Procrustes
rotation.

dimension.  A dimension is a parameter or measurement required to define
a characteristic of an object or observation.  Dimensions are the
variables in the dataset.  Weight, height, age, blood pressure, and
drug dose are examples of dimensions in health data.  Number of
employees, gross income, net income, tax, and year are examples of

discriminant analysis.  Discriminant analysis is used to describe the
differences between groups and to exploit those differences when
allocating (classifying) observations of unknown group membership.
Discriminant analysis is also called classification in many
references.

discriminant function.  Discriminant functions are formed from the
eigenvectors from Fisher's approach to LDA.  See linear discriminant
analysis.  See classification function for an alternative.

discriminating variables.  Discriminating variables in a discriminant
analysis are analyzed to determine differences between groups where
group membership is known.  These differences between groups are then
exploited when classifying observations to the groups.

disparity.  Disparities are transformed dissimilarities, that is,
dissimilarity values transformed by some function.  The class of
functions to transform dissimilarities to disparities may either be
(1) a class of metric, or known functions such as linear functions or
power functions that can be parameterized by real scalars or (2) a
class of more general (nonmetric) functions, such as any monotonic
function.  Disparities are used in MDS.  Also see dissimilarity,
multidimensional scaling, metric scaling, and nonmetric scaling.

dissimilarity, dissimilarity matrix, and dissimilarity measure.
Dissimilarity or a dissimilarity measure is a quantification of the
difference between two things, such as observations or variables or
groups of observations or a method for quantifying that difference.
A dissimilarity matrix is a matrix containing dissimilarity
measurements.  Euclidean distance is one example of a dissimilarity
measure.  Contrast to similarity.  Also see proximity and Euclidean
distance.

divisive hierarchical clustering methods.  Divisive hierarchical
clustering methods are top-down methods for hierarchical clustering.
All the data begin as a part of one large cluster; with each
iteration, a cluster is broken into two to create two new clusters.
At the first iteration there are two clusters, then three, and so on.
Divisive methods are very computationally expensive.  Contrast to
agglomerative hierarchical clustering methods.

eigenvalue.  An eigenvalue is the scale factor by which an eigenvector is
multiplied.  For many multivariate techniques, the size of an
eigenvalue indicates the importance of the corresponding eigenvector.
Also see eigenvector.

eigenvector.  An eigenvector of a linear transformation is a nonzero
vector that is either left unaffected or simply multiplied by a scale
factor after the transformation.

Here x is an eigenvector of linear transformation A with eigenvalue
lambda:

Ax = lambda x

For many multivariate techniques, eigenvectors form the basis for

equamax rotation.  Equamax rotation is an orthogonal rotation whose
criterion is a weighted sum of the varimax and quartimax criteria.
Equamax reflects a concern for simple structure within the rows and
columns of the matrix.  It is equivalent to oblimin with gamma = p/2,
or to the Crawford-Ferguson family with kappa = f/2p, where p is the
number of rows of the matrix to be rotated, and f is the number of
columns.  Also see orthogonal rotation, varimax rotation, quartimax
rotation, oblimin rotation, and Crawford-Ferguson rotation.

Euclidean distance.  The Euclidean distance between two observations is
the distance one would measure with a ruler.  The distance between
vector P = (P_1, P_2, ..., P_n) and Q = (Q_1, Q_2, ..., Q_n) is given
by

D(P, Q) = sqrt{(P_1 - Q_1)^2 + (P_2 - Q_2)^2 + ... + (P_n -
Q_n)^2} = sqrt{sum_{i=1}^n (P_i - Q_i)^2}

factor.  A factor is an unobserved random variable that is thought to
explain variability among observed random variables.

factor analysis.  Factor analysis is a statistical technique used to
explain variability among observed random variables in terms of fewer
unobserved random variables called factors.  The observed variables
are then linear combinations of the factors plus error terms.

If the correlation matrix of the observed variables is R, then R is
decomposed by factor analysis as

R = Lambda Phi Lambda' + Psi

variances, for example, the variance specific to the variable not
explained by the factors.  The default unrotated form assumes
uncorrelated common factors, Phi = I.

multiply the factors to produce the observed variables in the factor
analysis.

factor parsimony.  Factor parsimony is an oblique rotation, which
maximizes the column simplicity of the matrix.  It is equivalent to a
Crawford-Ferguson rotation with kappa = 1.  Also see oblique rotation
and Crawford-Ferguson rotation.

factor scores.  Factor scores are computed after factor analysis.  Factor
scores are the coordinates of the original variables, x, in the space
of the factors.  The two types of scoring are regression scoring
(Thomson 1951) and Bartlett (1937, 1938) scoring.

Using the symbols defined in factor analysis, the formula for
regression scoring is

f[hat] = Lambda'R^{-1}x

In the case of oblique rotation the formula becomes

f[hat] = Phi Lambda'R^{-1}x

The formula for Bartlett scoring is

f[hat] = Gamma^{-1}Lambda'Psi^{-1}x

where

Gamma = Lambda'Psi^{-1}Lambda

Also see factor analysis.

Heywood case or Heywood solution.  A Heywood case can appear in factor
analysis output; this indicates that a boundary solution, called a
Heywood solution, was produced.  The geometric assumptions underlying
the likelihood-ratio test are violated, though the test may be useful
if interpreted cautiously.

hierarchical clustering and hierarchical clustering methods.  In
hierarchical clustering, the data is placed into clusters via
iterative steps. Contrast to partition clustering.  Also see
agglomerative hierarchical clustering methods and divisive
hierarchical clustering methods.

Hotelling's T-squared generalized means test.  Hotelling's T-squared
generalized means test is a multivariate test that reduces to a
standard t test if only one variable is specified.  It tests whether
one set of means is zero or if two sets of means are equal.

inertia.  In CA, the inertia is related to the definition in applied
mathematics of "moment of inertia", which is the integral of the mass
times the squared distance to the centroid.  Inertia is defined as
the total Pearson chi-squared for the two-way table divided by the
total number of observations, or the sum of the squared singular
values found in the singular value decomposition.

total inertia = 1/n chi^2 = sum_k lambda^2_k

In MCA, the inertia is defined analogously.  In the case of the
indicator or Burt matrix approach, it is given by the formula

total inertia = (q/q-1) sum phi_t^2 - (J-q)/q^2

where q is the number of active variables, J is the number of
categories and phi_t is the tth (unadjusted) eigenvalue of the eigen
decomposition.  In JCA the total inertia of the modified Burt matrix
is defined as the sum of the inertias of the off-diagonal blocks.
Also see correspondence analysis and multiple correspondence
analysis.

iterated principal-factor method.  The iterated principal-factor method
is a method for performing factor analysis in which the communalities
Lambda[hat] using

h_i^2[hat] = sum_{j=1}^m lambda_{ij}^2[hat]

Also see factor analysis and communality.

JCA.  An acronym for joint correspondence analysis; see multiple
correspondence analysis.

joint correspondence analysis.  See multiple correspondence analysis.

Kaiser-Meyer-Olkin measure of sampling adequacy.  The Kaiser-Meyer-Olkin
(KMO) measure of sampling adequacy takes values between 0 and 1, with
small values meaning that the variables have too little in common to
warrant a factor analysis or PCA.  Historically, the following labels
have been given to values of KMO (Kaiser 1974):

0.00 to 0.49    unacceptable
0.50 to 0.59    miserable
0.60 to 0.69    mediocre
0.70 to 0.79    middling
0.80 to 0.89    meritorious
0.90 to 1.00    marvelous

kmeans.  Kmeans is a method for performing partition cluster analysis.
The user specifies the number of clusters, k, to create using an
iterative process.  Each observation is assigned to the group whose
mean is closest, and then based on that categorization, new group
means are determined.  These steps continue until no observations
change groups.  The algorithm begins with k seed values, which act as
the k group means.  There are many ways to specify the beginning seed
values.  Also see partition clustering.

kmedians.  Kmedians is a variation of kmeans.  The same process is
performed, except that medians instead of means are computed to
represent the group centers at each step.  Also see kmeans and
partition clustering.

KMO.  See Kaiser-Meyer-Olkin measure of sampling adequacy.

KNN.  See kth nearest neighbor.

Kruskal stress.  The Kruskal stress measure (Kruskal 1964; Cox and Cox
2001, 63) used in MDS is given by

Kruskal(D[hat],E) = {(sum (E_{ij} - D_{ij})^2[hat])/sum
E_{ij}^2}^{1/2}

where D_{ij} is the dissimilarity between objects i and j, 1 < i,j <
n, and D_{ij}[hat} is the disparity, that is, the transformed
dissimilarity, and E_{ij} is the Euclidean distance between rows i
and j of the matching configuration.  Kruskal stress is an example of
a loss function in modern MDS.  After classical MDS, estat stress
gives the Kruskal stress.  Also see classical scaling,
multidimensional scaling, and stress.

kth nearest neighbor.  kth-nearest-neighbor (KNN) discriminant analysis
is a nonparametric discrimination method based on the k nearest
neighbors of each observation.  Both continuous and binary data can
be handled through the different similarity and dissimilarity
measures.  KNN analysis can distinguish irregular-shaped groups,
including groups with multiple modes.  Also see discriminant analysis
and nonparametric methods.

Lawley-Hotelling trace.  The Lawley-Hotelling trace is a test statistic
for the hypothesis test H_0: mu_1 = mu_2 = ... = mu_k based on the
eigenvalues lambda_1, lambda_2, ..., lambda_s of E^{-1}H.  It is
defined as

U^{(s)} = trace(E^{-1}H) = sum_{i=1}^s lambda_i

where H is the between matrix and E is the within matrix, see between
matrix.

LDA.  See linear discriminant analysis.

leave one out.  In discriminant analysis, classification of an
observation while leaving it out of the estimation sample is done to
check the robustness of the analysis; thus the phrase "leave one out"
(LOO).  Also see discriminant analysis.

linear discriminant analysis.  Linear discriminant analysis (LDA) is a
parametric form of discriminant analysis.  In Fisher's (1936)
approach to LDA, linear combinations of the discriminating variables
provide maximal separation between the groups.  The Mahalanobis
(1936) formulation of LDA assumes that the observations come from
multivariate normal distributions with equal covariance matrices.
Also see discriminant analysis and parametric methods.

proximity between groups or clusters.

techniques, including factor analysis, PCA, MANOVA, LDA, and
primary interest and are examined for interpretability.  For many
correlation or covariance matrix.  Also see eigenvector.

LDA, factor analysis or PCA.

logistic discriminant analysis.  Logistic discriminant analysis is a form
of discriminant analysis based on the assumption that the likelihood
ratios of the groups have an exponential form.  Multinomial logistic
regression provides the basis for logistic discriminant analysis.
Because multinomial logistic regression can handle binary and
continuous regressors, logistic discriminant analysis is also
appropriate for binary and continuous discriminating variables.  Also
see discriminant analysis.

LOO.  See leave one out.

loss.  Modern MDS is performed by minimizing a loss function, also called
a loss criterion.  The loss quantifies the difference between the
disparities and the Euclidean distances.

Loss functions include Kruskal's stress and its square, both
normalized with either disparities or distances, the strain criterion
which is equivalent to classical metric scaling when the disparities
equal the dissimilarities, and the Sammon (1969) mapping criterion
which is the sum of the scaled, squared differences between the
distances and the disparities, normalized by the sum of the
disparities.

Also see multidimensional scaling, Kruskal stress, classical scaling,
and disparity.

Mahalanobis distance.  The Mahalanobis distance measure is a
scale-invariant way of measuring distance.  It takes into account the
correlations of the dataset.

Mahalanobis transformation.  The Mahalanobis transformation takes a
Cholesky factorization of the inverse of the covariance matrix S^{-1}
in the formula for Mahalanobis distance and uses it to transform the
data.  If we have the Cholesky factorization S^{-1} = L'L, then the
Mahalanobis transformation of x is z = Lx, and z'z = D_{M}^2(x).

MANCOVA.  MANCOVA is multivariate analysis of covariance.  See
multivariate analysis of variance.

MANOVA.  multivariate analysis of variance.

mass.  In CA and MCA, the mass is the marginal probability.  The sum of
the mass over the active row or column categories equals 1.

matching coefficient.  The matching similarity coefficient is used to
compare two binary variables.  If a is the number of observations
that both have value 1, and d is the number of observations that both
have value 0, and b, c are the number of (1,0) and (0,1)
observations, respectively, then the matching coefficient is given by

(a + d)/(a+b+c+d) Also see similarity measure.

matching configuration.  In MDS, the matching configuration is the low
dimensional configuration whose distances approximate the
high-dimensional dissimilarities or disparities.  Also see
multidimensional scaling, dissimilarity, and disparity.

matching configuration plot.  After MDS, this is a scatter plot of the
matching configuration.

maximum likelihood factor method.  The maximum likelihood factor method
is a method for performing factor analysis that assumes multivariate
normal observations.  It maximizes the determinant of the partial
correlation matrix; thus, this solution is also meaningful as a
descriptive method for nonnormal data.  Also see factor analysis.

MCA.  See multiple correspondence analysis.

MDS.  See multidimensional scaling.

MDS configuration plot.  See configuration plot.

measure.  A measure is a quantity representing the proximity between
objects or method for determining the proximity between objects.
Also see proximity.

clustering method that uses the distance between the medians of two
groups to determine the similarity or dissimilarity of the two
groups.  Also see cluster analysis and agglomerative hierarchical
clustering methods.

metric scaling.  Metric scaling is a type of MDS, in which the
dissimilarities are transformed to disparities via a class of known
functions.  This is contrasted to nonmetric scaling.  Also see
multidimensional scaling.

minimum entropy rotation.  The minimum entropy rotation is an orthogonal
rotation achieved by minimizing the deviation from uniformity
(entropy).  The minimum entropy criterion (Jennrich 2004) is

c(Lambda) = - 1/2 <Lambda^2, log Lambda^2>

See Crawford-Ferguson rotation for a definition of Lambda.  Also see
orthogonal rotation.

misclassification rate.  The misclassification rate calculated after
discriminant analysis is, in its simplest form, the fraction of
observations incorrectly classified. See discriminant analysis.

modern scaling.  Modern scaling is a form of MDS that is achieved via the
minimization of a loss function that compares the disparities
(transformed dissimilarities) in the higher-dimensional space and the
distances in the lower-dimensional space.  Contrast to classical
scaling.  Also see dissimilarity, disparity, multidimensional
scaling, and loss.

multidimensional scaling.  Multidimensional scaling (MDS) is a
dimension-reduction and visualization technique.  Dissimilarities
(for instance, Euclidean distances) between observations in a
high-dimensional space are represented in a lower-dimensional space
which is typically two dimensions so that the Euclidean distance in
the lower-dimensional space approximates in some sense the
dissimilarities in the higher-dimensional space.  Often the
higher-dimensional dissimilarities are first transformed to
disparities, and the disparities are then approximated by the
distances in the lower-dimensional space.  Also see dissimilarity,
disparity, classical scaling, loss, modern scaling, metric scaling,
and nonmetric scaling.

multiple correspondence analysis.  Multiple correspondence analysis (MCA)
and joint correspondence analysis (JCA) are methods for analyzing
observations on categorical variables.  MCA and JCA analyze a
multiway table and are usually viewed as an extension of CA.  Also
see correspondence analysis.

multivariate analysis of covariance.  See multivariate analysis of
variance.

multivariate analysis of variance.  Multivariate analysis of variance
(MANOVA) is used to test hypotheses about means.  Four multivariate
statistics are commonly computed in MANOVA: Wilks's lambda, Pillai's
trace, Lawley-Hotelling trace, and Roy's largest root.  Also see
Wilks's lambda, Pillai's trace, Lawley-Hotelling trace, and Roy's
largest root.

multivariate regression.  Multivariate regression is a method of
estimating a linear (matrix) model

Y = XB + Xi Multivariate regression is estimated by least-squares
regression, and it can be used to test hypotheses, much like
MANOVA.

nearest neighbor.  See kth nearest neighbor.

nonmetric scaling.  Nonmetric scaling is a type of modern MDS in which
the dissimilarities may be transformed to disparities via any
monotonic function as opposed to a class of known functions.
Contrast to metric scaling.  Also see multidimensional scaling,
dissimilarity, disparity, and modern scaling.

nonparametric methods.  Nonparametric statistical methods, such as KNN
discriminant analysis, do not assume the population fits any
parameterized distribution.

normalization.  Normalization presents information in a standard form for
interpretation.  In CA the row and column coordinates can be
normalized in different ways depending on how one wishes to interpret
the data.  Normalization is also used in rotation, MDS, and MCA.

oblimax rotation.  Oblimax rotation is a method for oblique rotation
to orthogonal rotation, oblimax is equivalent to quartimax rotation.
Oblimax minimizes the oblimax criterion

c(Lambda) = - log(<Lambda^2, Lambda^2>) + 2 log(<Lambda, Lambda>)

See Crawford-Ferguson rotation for a definition of Lambda.  Also see
oblique rotation, orthogonal rotation, and quartimax rotation.

oblimin rotation.  Oblimin rotation is a general method for oblique
rotation, achieved by minimizing the oblimin criterion

c(Lambda) = 1/4 <Lambda^2, {I - (gamma/p) 1 1'} Lambda^2(1 1' -
I)>

Oblimin has several interesting special cases:

gamma          Special case
-----------------------------------------
0              quartimax / quartimin
1/2            biquartimax / biquartimin
1              varimax / covarimin
p/2            equamax
-----------------------------------------
p = number of rows of A.

See Crawford-Ferguson rotation for a definition of Lambda and A.
Also see oblique rotation.

oblique rotation or oblique transformation.  An oblique rotation
maintains the norms of the rows of the matrix but not their inner
products.  In geometric terms, this maintains the lengths of vectors,
but not the angles between them.  In contrast, in orthogonal
rotation, both are preserved.

ordination.  Ordination is the ordering of a set of data points with
respect to one or more axes.  MDS is a form of ordination.

orthogonal rotation or orthogonal transformation.  Orthogonal rotation
maintains both the norms of the rows of the matrix and also inner
products of the rows of the matrix.  In geometric terms, this
maintains both the lengths of vectors and the angles between them.
In contrast, oblique rotation maintains only the norms, that is, the
lengths of vectors.

parametric methods.  Parametric statistical methods, such as LDA and QDA,
assume the population fits a parameterized distribution.  For
example, for LDA we assume the groups are multivariate normal with
equal covariance matrices.

parsimax rotation.  Parsimax rotation is an orthogonal rotation that
balances complexity between the rows and the columns.  It is
equivalent to the Crawford-Ferguson family with kappa = (f-1)/(p+f
-2), where p is the number of rows of the original matrix, and f is
the number of columns.  See orthogonal rotation and Crawford-Ferguson
rotation.

partially specified target rotation.  Partially specified target rotation
minimizes the criterion

c(Lambda) = |W otimes (Lambda - H)|^2

for a given target matrix H and a nonnegative weighting matrix W
(usually zero-one valued).  See Crawford-Ferguson rotation for a
definition of Lambda.

partition clustering and partition cluster-analysis methods.  Partition
clustering methods break the observations into a distinct number of
nonoverlapping groups.  This is accomplished in one step, unlike
hierarchical cluster-analysis methods, in which an iterative
procedure is used.  Consequently, this method is quicker and will
allow larger datasets than the hierarchical clustering methods.
Contrast to hierarchical clustering.  Also see kmeans and kmedians.

PCA.  See principal component analysis.

Pillai's trace.  Pillai's trace is a test statistic for the hypothesis
test H_0: mu_1 = mu_2 = ... = mu_k based on the eigenvalues lambda_1,
..., lambda_s of E^{-1}H.  It is defined as

V^{(s)} = trace[(E + H)^{-1}H] = sum_{i=1}^s lambda_i/(1 +
lambda_i)

where H is the between matrix and E is the within matrix.  See
between matrix.

posterior probabilities.  After discriminant analysis, the posterior
probabilities are the probabilities of a given observation being
assigned to each of the groups based on the prior probabilities, the
training data, and the particular discriminant model.  Contrast to
prior probabilities.

principal component analysis.  Principal component analysis (PCA) is a
statistical technique used for data reduction.  The leading
eigenvectors from the eigen decomposition of the correlation or the
covariance matrix of the variables describe a series of uncorrelated
linear combinations of the variables that contain most of the
variance.  In addition to data reduction, the eigenvectors from a PCA
the data.

principal factor method.  The principal factor method is a method for
patterns, are computed using the squared multiple correlations as
estimates of the communality.  Also see factor analysis and
communality.

prior probabilities Prior probabilities in discriminant analysis are the
probabilities of an observation belonging to a group before the
discriminant analysis is performed.  Prior probabilities are often
based on the prevalence of the groups in the population as a whole.
Contrast to posterior probabilities.

Procrustes rotation.  A Procrustes rotation is an orthogonal or oblique
transformation, that is, a restricted Procrustes transformation
without translation or dilation (uniform scaling).

Procrustes transformation.  The goal of Procrustes transformation is to
transform the source matrix X to be as close as possible to the
target Y.  The permitted transformations are any combination of
dilation (uniform scaling), rotation and reflection (that is,
orthogonal or oblique transformations), and translation.  Closeness
is measured by residual sum of squares.  In some cases, unrestricted
Procrustes transformation is desired; this allows the data to be
transformed not just by orthogonal or oblique rotations, but by all
conformable regular matrices A.  Unrestricted Procrustes
transformation is equivalent to a multivariate regression.

The name comes from Procrustes of Greek mythology; Procrustes invited
guests to try his iron bed.  If the guest was too tall for the bed,
Procrustes would amputate the guest's feet, and if the guest was too
short, he would stretch the guest out on a rack.

Also see orthogonal rotation, oblique rotation, dilation, and
multivariate regression.

promax power rotation.  Promax power rotation is an oblique rotation.  It
does not fit in the minimizing-a-criterion framework that is at the
core of most other rotations.  The promax method (Hendrickson and
White 1964) was proposed before computing power became widely
available.  The promax rotation consists of three steps:

1.  Perform an orthogonal rotation.

2. Raise the elements of the rotated matrix to some power,
preserving the sign of the elements.  Typically the power is
in the range 2 < power < 4.  This operation is meant to
distinguish clearly between small and large values.

3. The matrix from step two is used as the target for an oblique
Procrustean rotation from the original matrix.

proximity, proximity matrix, and proximity measure.  Proximity or a
proximity measure means the nearness or farness of two things, such
as observations or variables or groups of observations or a method
for quantifying the nearness or farness between two things.  A
proximity is measured by a similarity or dissimilarity.  A proximity
matrix is a matrix of proximities.  Also see similarity and
dissimilarity.

is a parametric form of discriminant analysis and is a generalization
of LDA.  Like LDA, QDA assumes that the observations come from a
multivariate normal distribution, but unlike LDA, the groups are not
assumed to have equal covariance matrices.  Also see discriminant
analysis, linear discriminant analysis, and parametric methods.

quartimax rotation.  Quartimax rotation maximizes the variance of the
squared loadings within the rows of the matrix.  It is an orthogonal
rotation that is equivalent to minimizing the criterion

c(Lambda) = sum_i sum_r lambda^4_{ir} = -1/4 <Lambda^2, Lambda^2>

See Crawford-Ferguson rotation for a definition of Lambda.

quartimin rotation.  Quartimin rotation is an oblique rotation that is
equivalent to quartimax rotation when quartimin is restricted to
orthogonal rotations.  Quartimin is equivalent to oblimin rotation
with gamma = 0.  Also see quartimax rotation, oblique rotation,
orthogonal rotation, and oblimin rotation.

reflection.  A reflection is an orientation reversing orthogonal
transformation, that is, a transformation that involves negating
coordinates in one or more dimensions.  A reflection is a Procrustes
transformation.

repeated measures.  Repeated measures data have repeated measurements for
the subjects over some dimension, such as time -- for example test
scores at the start, midway, and end of the class.  The repeated
observations are typically not independent.  Repeated-measures ANOVA
is one approach for analyzing repeated measures data, and MANOVA is
another.  Also see sphericity.

rotation.  A rotation is an orientation preserving orthogonal
transformation.  A rotation is a Procrustes transformation.

Roy's largest root.  Roy's largest root test is a test statistic for the
hypothesis test H_0: mu_1 = ... = mu_k based on the largest
eigenvalue of E^{-1}H.  It is defined as

theta = lambda_1/(1+lambda_1)

Here H is the between matrix, and E is the within matrix. See between
matrix.

Sammon mapping criterion.  The Sammon (1969) mapping criterion is a loss
criterion used with MDS; it is the sum of the scaled, squared
differences between the distances and the disparities, normalized by
the sum of the disparities.  Also see multidimensional scaling,
modern scaling, and loss.

score.  A score for an observation after factor analysis, PCA, or LDA is
derived from a column of the loading matrix and is obtained as the
linear combination of that observation's data by using the

score plot.  A score plot produces scatterplots of the score variables
after factor analysis, PCA, or LDA.

scree plot.  A scree plot is a plot of eigenvalues or singular values
ordered from greatest to least after an eigen decomposition or
singular value decomposition.  Scree plots help determine the number
of factors or components in an eigen analysis.  Scree is the
accumulation of loose stones or rocky debris lying on a slope or at
the base of a hill or cliff; this plot is called a scree plot because
it looks like a scree slope.  The goal is to determine the point
where the mountain gives way to the fallen rock.

Shepard diagram.  A Shepard diagram after MDS is a 2-dimensional plot of
high-dimensional dissimilarities or disparities versus the resulting
low-dimensional distances.  Also see multidimensional scaling.

similarity, similarity matrix, and similarity measure.  A similarity or a
similarity measure is a quantification of how alike two things are,
such as observations or variables or groups of observations, or a
method for quantifying that alikeness.  A similarity matrix is a
matrix containing similarity measurements.  The matching coefficient
is one example of a similarity measure.  Contrast to dissimilarity.
Also see proximity and matching coefficient.

clustering method that computes the proximity between two groups as
the proximity between the closest pair of observations between the
two groups.

singular value decomposition.  A singular value decomposition (SVD) is a
factorization of a rectangular matrix.  It says that if M is an m*n
matrix, there exists a factorization of the form

M = U Sigma V^*

where U is an m*m unitary matrix, Sigma is an m*n matrix with
nonnegative numbers on the diagonal and zeros off the diagonal, and
V^* is the conjugate transpose of V, an n*n unitary matrix.  If M is
a real matrix, then so is V, and V^* = V'.

sphericity.  Sphericity is the state or condition of being a sphere.  In
repeated measures ANOVA, sphericity concerns the equality of variance
in the difference between successive levels of the repeated measure.
The multivariate alternative to ANOVA, called MANOVA, does not
require the assumption of sphericity.  Also see repeated measures.

SSCP matrix.  SSCP is an acronym for the sums of squares and cross
products.  Also see between matrix.

stacked variables.  See crossed variables.

stacking variables.  See crossing variables.

standardized data.  Standardized data has a mean of zero and a standard
deviation of one.  You can standardize data x by taking (x -
x[bar])/sigma, where sigma is the standard deviation of the data.

stopping rules.  Stopping rules for hierarchical cluster analysis are
used to determine the number of clusters.  A stopping-rule value
(also called an index) is computed for each cluster solution, that
is, at each level of the hierarchy in hierarchical cluster analysis.
Also see hierarchical clustering.

stress.  See Kruskal stress and loss.

structure.  Structure, as in factor structure, is the correlations
between the variables and the common factors after factor analysis.
Structure matrices are available after factor analysis and LDA.  Also
see factor analysis and linear discriminant analysis.

supplementary rows or columns or supplementary variables.  Supplementary
rows or columns can be included in CA, and supplementary variables
can be included in MCA.  They do not affect the CA or MCA solution,
but they are included in plots and tables with statistics of the
corresponding row or column points.  Also see correspondence analysis
and multiple correspondence analysis.

SVD.  See singular value decomposition.

target rotation.  Target rotation minimizes the criterion

c(Lambda) = 1/2|Lambda - H|^2

for a given target matrix H.

See Crawford-Ferguson rotation for a definition of Lambda.

taxonomy.  Taxonomy is the study of the general principles of scientific
classification.  It also denotes classification, especially the
classification of plants and animals according to their natural
relationships.  Cluster analysis is a tool used in creating a
taxonomy and is synonymous with numerical taxonomy.  Also see cluster
analysis.

tetrachoric correlation.  A tetrachoric correlation estimates the
correlation coefficients of binary variables by assuming a latent
bivariate normal distribution for each pair of variables, with a
threshold model for manifest variables.

ties.  After discriminant analysis, ties in classification occur when two
or more posterior probabilities are equal for an observation.  They
are most common with KNN discriminant analysis.

total inertia or total principal inertia.  The total (principal) inertia
in CA and MCA is the sum of the principal inertias.  In CA, total
inertia is the Pearson chi^2/n.  In CA, the principal inertias are
the singular values; in MCA the principal inertias are the
eigenvalues.  Also see correspondence analysis and multiple
correspondence analysis.

uniqueness.  In factor analysis, the uniqueness is the percentage of a
variable's variance that is not explained by the common factors.  It
is also "1 - communality".  Also see communality.

unrestricted transformation.  An unrestricted transformation is a
Procrustes transformation that allows the data to be transformed, not
just by orthogonal and oblique rotations, but by all conformable
regular matrices.  This is equivalent to a multivariate regression.
Also see Procrustes transformation and multivariate regression.

varimax rotation.  Varimax rotation maximizes the variance of the squared
rotation equivalent to oblimin with gamma = 1 or to the
Crawford-Ferguson family with kappa = 1/p, where p is the number of
rows of the matrix to be rotated.  Also see orthogonal rotation,
oblimin rotation, and Crawford-Ferguson rotation.

clustering method that joins the two groups resulting in the minimum
increase in the error sum of squares.

is a hierarchical clustering method that uses the weighted average
similarity or dissimilarity of the two groups as the measure between
the two groups.

Wilks's lambda.  Wilks's lambda is a test statistic for the hypothesis
test H_0: mu_1 = mu_2 = ... = mu_k based on the eigenvalues lambda_1,
..., lambda_s of E^{-1}H.  It is defined as

Lambda = |E|}/{|E + H|} = prod_{i=1}^s 1/(1 + lambda_i)

where H is the between matrix and E is the within matrix. See between
matrix.

Wishart distribution.  The Wishart distribution is a family of
probability distributions for nonnegative-definite matrix-valued
random variables ("random matrices").  These distributions are of
great importance in the estimation of covariance matrices in
multivariate statistics.

within matrix.  See between matrix.

References

Bartlett, M. S. 1937. The statistical conception of mental factors.
British Journal of Psychology 28: 97-104.

------. 1938. Methods of estimating mental factors.  Nature, London 141:
609-610.

Bellman, R. E. 1961. Adaptive Control Processes. Princeton, NJ:
Princeton University Press.

Bentler, P. M. 1977. Factor simplicity index and transformations.
Psychometrika 42: 277-295.

Comrey, A. L. 1967. Tandem criteria for analytic rotation in factor
analysis.  Psychometrika 32: 277-295.

Cox, T. F., and M. A. A. Cox. 2001.  Multidimensional Scaling. 2nd ed.
Boca Raton, FL.  Chapman & Hall/CRC.

Crawford, C. B., and G. A. Ferguson. 1970. A general rotation criterion
and its use in orthogonal rotation. Psychometrika 35: 321-332.

Fisher, R. A. 1936. The use of multiple measurements in taxonomic
problems.  Annals of Eugenics 7: 179-188.

Hendrickson, A. E., and P. O. White. 1964. Promax: A quick method for
rotation to oblique simple structure.  British Journal of Statistical
Psychology 17: 65-70.

functions: The orthogonal case.  Psychometrika 69: 257-273.

Kaiser, H. F. 1974. An index of factor simplicity. Psychometrika 39:
31-36.

Kruskal, J. B. 1964. Multidimensional scaling by optimizing goodness of
fit to a nonmetric hypothesis. Psychometrika 29: 1-27.

Mahalanobis, P. C. 1936. On the generalized distance in statistics.
National Institute of Science in India 12: 49-55.

Sammon, J. W., Jr. 1969. A nonlinear mapping for data structure analysis.
IEEE Transactions on Computers 18: 401-409.

Thomson, G. H. 1951. The Factorial Analysis of Human Ability.  London:
University of London Press.

```