💡 學習重點:
  ■ 集群分析的基本觀念
  ■ 距離矩陣:Distance Matrix
  ■ 層級式集群分析:Hierarchical Cluster Analysis
  ■ 樹狀圖(Dendrogram)的判讀
  ■ 依據樹狀圖決定要分多少群
  ■ 以群組平均值檢視各族群的屬性


pacman::p_load(dplyr, ggplot2)


【A】航空公司顧客資料集

A = read.csv('data/AirlinesCluster.csv')
summary(A)
    Balance          QualMiles       BonusMiles       BonusTrans  
 Min.   :      0   Min.   :    0   Min.   :     0   Min.   : 0.0  
 1st Qu.:  18528   1st Qu.:    0   1st Qu.:  1250   1st Qu.: 3.0  
 Median :  43097   Median :    0   Median :  7171   Median :12.0  
 Mean   :  73601   Mean   :  144   Mean   : 17145   Mean   :11.6  
 3rd Qu.:  92404   3rd Qu.:    0   3rd Qu.: 23800   3rd Qu.:17.0  
 Max.   :1704838   Max.   :11148   Max.   :263685   Max.   :86.0  
  FlightMiles     FlightTrans    DaysSinceEnroll
 Min.   :    0   Min.   : 0.00   Min.   :   2   
 1st Qu.:    0   1st Qu.: 0.00   1st Qu.:2330   
 Median :    0   Median : 0.00   Median :4096   
 Mean   :  460   Mean   : 1.37   Mean   :4119   
 3rd Qu.:  311   3rd Qu.: 1.00   3rd Qu.:5790   
 Max.   :30817   Max.   :53.00   Max.   :8296   



【B】資料常態化

🗿 為甚麼要做資料常態化?

colMeans(A) %>% sort
    FlightTrans      BonusTrans       QualMiles     FlightMiles DaysSinceEnroll 
         1.3736         11.6019        144.1145        460.0558       4118.5594 
     BonusMiles         Balance 
     17144.8462      73601.3276 
AN = scale(A) %>% data.frame
sapply(AN, mean)
                 Balance                QualMiles               BonusMiles 
 0.000000000000000027654  0.000000000000000026507 -0.000000000000000042736 
              BonusTrans              FlightMiles              FlightTrans 
-0.000000000000000071911  0.000000000000000014818  0.000000000000000010741 
         DaysSinceEnroll 
 0.000000000000000055637 
sapply(AN, sd)
        Balance       QualMiles      BonusMiles      BonusTrans     FlightMiles 
              1               1               1               1               1 
    FlightTrans DaysSinceEnroll 
              1               1 



【C】層級式集群分析 Hirarchical Clustering

1.距離矩陣

d = dist(AN, method="euclidean")

2.層級式集群分析

hc = hclust(d, method='ward.D')

3.畫出樹狀圖

plot(hc)


🗿 如何從樹狀圖決定群數?

4.分割群組

kg = cutree(hc, k=5)
table(kg)
kg
   1    2    3    4    5 
 776  519  494  868 1342 



【D】觀察群組特性

sapply(split(A,kg), colMeans) %>% round(2) 
                       1         2         3        4        5
Balance         57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles           0.64   1065.98     30.35     4.85     2.51
BonusMiles      10360.12  22881.76  55795.86 20788.77  2264.79
BonusTrans         10.82     18.23     19.66    17.09     2.97
FlightMiles        83.18   2613.42    327.68   111.57   119.32
FlightTrans         0.30      7.40      1.07     0.34     0.44
DaysSinceEnroll  6235.36   4402.41   5615.71  2840.82  3060.08
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))



🗿 討論問題:
  ■ 請你們為這五個族群各起一個名稱
  ■ 請你們為這五個族群各設計一個行銷策略
  ■ 統計上最好的分群也是實務上最好的分群嗎
  ■ 除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?