From |
Partho Sarkar <partho.ss+lists@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?) |

Date |
Fri, 17 May 2013 17:54:31 +0530 |

I have just started using Stata for PCA, and am puzzled by a seeming error in the Multivariate Statistics Reference Manual. In the Chapter "Postestimation tools for pca and pcamat" [Stata Multivariate Statistics Reference Manual, Release 11, P 580], after having worked throught the example audiometry data and calculated the principal components (use http://www.stata-press.com/data/r11/audiometric (Audiometric measures), the manual says (long quote begins): [BEGIN QUOTE, with comments in square brackets] "Predicting the component scores After deciding on the number of components..., you may want to estimate the component scores for all respondents. To estimate only the first component scores, which here is called pc1: [enter command] predict pc1 [output] ------------------------------------------------------ Variable | Comp1 Comp2 Comp3 Comp4 -------------+---------------------------------------- lft500 | 0.4011 -0.3170 0.1582 -0.3278 lft1000 | 0.4210 -0.2255 -0.0520 -0.4816 lft2000 | 0.3664 0.2386 -0.4703 -0.2824 lft4000 | 0.2809 0.4742 0.4295 -0.1611 rght500 | 0.3433 -0.3860 0.2593 0.4876 rght1000 | 0.4114 -0.2318 -0.0289 0.3723 rght2000 | 0.3115 0.3171 -0.5629 0.3914 rght4000 | 0.2542 0.5135 0.4262 0.1591 ------------------------------------------------------ [This is just the Principal components (eigenvectors) matrix in the PC computations] The table is informing you that pc1 could be obtained as a weighted sum of standardized variables, . egen std_lft500 = std(lft500) . egen std_lft1000 = std(lft1000) . egen std_rght4000 = std(rght4000) [etc. etc.] . gen pc1 = 0.4011*std_lft500 + 0.4210*std_lft500 [TYPO] + ... + 0.2542*std_rght4000 [END QUOTE] Accordingly, after standardizing all the variables, I tried this corrected version of the equation above: gen pc1try = 0.4011*std_lft500 + 0.4210*std_lft1000 +0.3664*std_lft2000+0.2809*lft4000+0.3433*rght500+0.4114*rght1000+0.3115* rght2000+0.2542*std_rght4000 But assert pc1==pc1try produces : " 100 contradictions in 100 observations assertion is false r(9); " And sure enough, here is what the first few lines of data look like: list pc1 pc1try in 1/4 +-----------------------+ | pc1 pc1try | |-----------------------| 1. | 1.180442 8.493489 | 2. | -.2950325 3.019228 | 3. | .7345378 7.701978 | 4. | -2.132017 -12.07502 | +-----------------------+ What is going on here? I noticed of course that the formula given above (gen pc1, gen pc1try..) is not a weighted sum properly speaking, since the weights do not sum to 1. I tried a modification,dividing by the sum of the weights, but this too does not give the correct pc1: g pc2try=pc1try/2.7898 . assert pc1==pc2try 100 contradictions in 100 observations I am sorry if this is too obvious, or misunderstood on my part. Thanks and regards, Partho Sarkar

**Follow-Ups**:**Re: st: PCA: Principal Components as weighted sums of standardized variables (error in MV Ref. Manual?)***From:*Partho Sarkar <partho.ss+lists@gmail.com>

