Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?)

 From Partho Sarkar To statalist@hsphsun2.harvard.edu Subject st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?) Date Fri, 17 May 2013 17:54:31 +0530

I have just started using Stata for PCA, and am puzzled by a seeming
error in  the Multivariate Statistics Reference Manual.
In the Chapter "Postestimation tools for pca and pcamat" [Stata
Multivariate Statistics Reference Manual, Release 11, P 580], after
having worked throught the example audiometry data and calculated the
principal components
(use http://www.stata-press.com/data/r11/audiometric (Audiometric
measures), the manual says (long quote begins):

[BEGIN QUOTE, with comments in square brackets]  "Predicting the
component scores

After deciding on the number of components..., you may want to
estimate the component scores for all respondents. To estimate only
the first component scores, which here is called pc1:
[enter command]
predict pc1
[output]
------------------------------------------------------
Variable |    Comp1     Comp2     Comp3     Comp4
-------------+----------------------------------------
lft500 |   0.4011   -0.3170    0.1582   -0.3278
lft1000 |   0.4210   -0.2255   -0.0520   -0.4816
lft2000 |   0.3664    0.2386   -0.4703   -0.2824
lft4000 |   0.2809    0.4742    0.4295   -0.1611
rght500 |   0.3433   -0.3860    0.2593    0.4876
rght1000 |   0.4114   -0.2318   -0.0289    0.3723
rght2000 |   0.3115    0.3171   -0.5629    0.3914
rght4000 |   0.2542    0.5135    0.4262    0.1591
------------------------------------------------------

[This is just the Principal components (eigenvectors) matrix in the PC
computations]

The table is informing you that pc1 could be obtained as a weighted
sum of standardized variables,
. egen std_lft500 = std(lft500)
. egen std_lft1000 = std(lft1000)
. egen std_rght4000 = std(rght4000)
[etc. etc.]
. gen pc1 = 0.4011*std_lft500 + 0.4210*std_lft500 [TYPO] + ... +
0.2542*std_rght4000

[END QUOTE]

Accordingly, after standardizing all the variables, I tried this
corrected version of the equation above:

gen pc1try = 0.4011*std_lft500 + 0.4210*std_lft1000
+0.3664*std_lft2000+0.2809*lft4000+0.3433*rght500+0.4114*rght1000+0.3115*
rght2000+0.2542*std_rght4000

But
assert pc1==pc1try
produces :
" 100 contradictions in 100 observations
assertion is false
r(9); "

And sure enough, here is what the first few lines of data look like:

list  pc1 pc1try in 1/4
+-----------------------+
|       pc1      pc1try |
|-----------------------|
1. |  1.180442    8.493489 |
2. | -.2950325    3.019228 |
3. |  .7345378    7.701978 |
4. | -2.132017   -12.07502 |
+-----------------------+

What is going on here?  I noticed of course that the formula given
above (gen pc1, gen pc1try..) is not a weighted sum properly speaking,
since the weights do not sum to 1.  I tried a modification,dividing by
the sum of the weights, but this too does not give the correct pc1:

g pc2try=pc1try/2.7898

. assert pc1==pc2try

I am sorry if this is too obvious, or misunderstood on my part.

Thanks and regards,

Partho Sarkar
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/