💡 學習重點:
  ■ 尺度縮減的基本觀念
  ■ 主成分分析:Principle Component Analysis (PCA)     ■ 主成分 Priciple Components?
    ■ 特徵值 Eiganvalue & Variance Decomposition
  ■ 主成分分析的應用
  ■ 主成分分析和集群分析的綜合應用


pacman::p_load(dplyr, FactoMineR, factoextra)
十項運動資料集
D = decathlon2
head(D)
          X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
          Pole.vault Javeline X1500m Rank Points Competition
SEBRLE          5.02    63.19  291.7    1   8217    Decastar
CLAY            4.92    60.15  301.5    2   8122    Decastar
BERNARD         5.32    62.77  280.1    4   8067    Decastar
YURKOV          4.72    63.44  276.4    5   8036    Decastar
ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
McMULLEN        4.42    56.37  285.1    8   7995    Decastar


【A】主成分分析

我們使用FactoMineR套件的加強功能PCA(),通常用預設參數就行

pca = PCA(D[,1:10])

做完分析,它自動會把所有的「個體」和「變數」投射到前兩個「主成分」的平面上。

pca物件的內容

PCA()會回傳一個PCA物件,我們叫它pca

pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"          


pca$eig: 各主成分的資訊含量
  • 10個變數的PCA會產生10個主成分(互相正交的尺度)
  • 特徵值代表每一個主成分所攜帶的資訊量(變異量)
  • 第一個主成分的特徵值最大,依次遞減
  • 所有的特徵值加起來正好會等於變數的個數
get_eigenvalue(pca)
       eigenvalue variance.percent cumulative.variance.percent
Dim.1     3.74997          37.4997                      37.500
Dim.2     1.74517          17.4517                      54.951
Dim.3     1.51783          15.1783                      70.130
Dim.4     1.03220          10.3220                      80.452
Dim.5     0.61784           6.1784                      86.630
Dim.6     0.42829           4.2829                      90.913
Dim.7     0.32591           3.2591                      94.172
Dim.8     0.27938           2.7938                      96.966
Dim.9     0.19111           1.9111                      98.877
Dim.10    0.11230           1.1230                     100.000


【B】縮減空間中的變數 (Variables)

pca$var$coord: 各變數在各尺度的座標
pca$var$coord
                 Dim.1     Dim.2      Dim.3     Dim.4    Dim.5
X100m        -0.818952  0.342779  0.1008645  0.101342 -0.21981
Long.jump     0.758899 -0.381493 -0.0062613 -0.185424  0.26371
Shot.put      0.715078  0.282117  0.4738546  0.036104 -0.27864
High.jump     0.608493  0.611354  0.0046060  0.071244  0.30059
X400m        -0.643848  0.148422  0.5157594  0.269785  0.19924
X110m.hurdle -0.716420  0.297552  0.4164510 -0.159781  0.16102
Discus        0.716888  0.204398  0.2703222  0.397623 -0.33949
Pole.vault   -0.221417 -0.737548  0.4030836 -0.251549 -0.26259
Javeline      0.355176  0.098531  0.6954337 -0.485559  0.13342
X1500m        0.069712 -0.568120  0.3527578  0.652461  0.25368
pca$var$coord: 各變數在各尺度呈現的資訊比率
pca$var$cos2
                 Dim.1     Dim.2       Dim.3     Dim.4    Dim.5
X100m        0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump    0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put     0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump    0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m        0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus       0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault   0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline     0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m       0.0048598 0.3227600 0.124438033 0.4257058 0.064353
將變數投射到主成分空間
fviz_pca_var(pca)



【C】縮減空間中的個體 (Individuals)

pca$ind$coord: 個體在各尺度的座標
pca$ind$coord
                Dim.1     Dim.2     Dim.3     Dim.4     Dim.5
SEBRLE       0.277958 -0.536434  1.585239  0.105823  1.074623
CLAY         0.904854 -2.094280  0.840685  1.850718 -0.408645
BERNARD     -1.372266 -1.348116  0.961932 -1.493072 -0.182667
YURKOV      -0.928205  2.281744  1.942688  0.096823  0.190927
ZSIVOCZKY   -0.103817  1.089822 -2.098908  0.071906 -0.032938
McMULLEN     0.239858  0.939092 -0.818136  1.201893  1.830199
MARTINEAU   -2.537291  1.801094  0.051975  0.374306 -2.285411
HERNU       -1.902843 -0.330277  1.288682  0.766505  0.239465
BARRAS      -1.805625  0.302590 -0.592810  0.656526 -0.244039
NOOL        -2.881737  0.863854 -1.402448 -1.491195  1.358726
BOURGUIGNON -4.505530 -0.485422  1.202704  0.951363  0.508865
Sebrle       3.567756  0.068007  1.911216 -1.042363 -0.300596
Clay         3.472177 -0.705599  1.607029 -0.696108  0.741815
Karpov       4.328761  0.160789 -1.152529  0.407689 -0.772485
Macey        1.944475  2.523948 -0.260304 -0.079809 -0.025024
Warners      1.552082 -1.488634 -1.414196 -0.549665  0.102190
Zsivoczky    0.475153  1.971763  0.900183 -0.725288  0.171696
Hernu        0.280841  0.822696 -0.905794 -0.782389 -0.771389
Bernard      1.533280  1.085832 -1.245717  0.534722  1.042879
Schwarzl    -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov   -0.077879 -0.333658  0.607951  1.446999  0.204623
Schoenbeck  -0.487405 -0.860688  0.866712 -0.173226 -0.432985
Barras      -0.413081  1.366893  0.227296 -0.753324 -0.861778
KARPOV       0.967748 -0.995599 -0.476019  2.501069 -0.661349
WARNERS     -0.280043 -0.912158 -1.416982 -0.147161  0.036162
Nool        -0.535389 -2.135638  0.616053 -1.808079 -0.319954
Drews       -1.035856 -1.917364 -2.404320 -0.614812 -0.102226
pca$ind$coord: 個體在各尺度呈現的資訊比率
pca$ind$cos2
                Dim.1      Dim.2      Dim.3      Dim.4       Dim.5
SEBRLE      0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY        0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD     0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV      0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY   0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN    0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU   0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU       0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS      0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL        0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle      0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay        0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov      0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey       0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners     0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky   0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu       0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard     0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl    0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov   0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck  0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras      0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV      0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS     0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool        0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews       0.0963567 0.33013525 0.51911955 0.03394434 0.000938440
將變數投射到主成分空間
fviz_pca_ind(pca)



【D】同時投射個體和變數 (Biploy)

將個體和變數投射到主成分空間
fviz_pca_biplot(
  pca, pointsize="cos2", repel=T,
  col.var="red", col.ind="#E7B800", alpha.ind=0.3)

將個體分群
kmg = kmeans(D[,1:10],3)$cluster %>% factor
table(kmg)
kmg
 1  2  3 
 9  5 13 
將個體和變數投射到主成分空間
fviz_pca_biplot(
  pca, repel=T, col.var="black", 
  col.ind=kmg, alpha.ind=0.6, pointshape=16, 
  addEllipses = TRUE, ellipse.level = 0.6, mean.point = FALSE)


💡 FactoMineRfactoextra這兩個套件非常的強大,除了連續變數之外,它們也可以做類別變數、甚至於混合變數的主成分分析;他們的繪圖功能也非常靈活,除了投射本身的變數和個體之外,區隔變數以外的連續或類別變數,或者是不在原資料之中的新資料點,都可以投射到主成分空間裡面。