💡 學習重點:
■ 集群分析的基本觀念
■ 距離矩陣:Distance Matrix
■ 層級式集群分析:Hierarchical Cluster Analysis
■ 樹狀圖(Dendrogram)的判讀
■ 依據樹狀圖決定要分多少群
■ 以群組平均值檢視各族群的屬性
pacman::p_load(dplyr, ggplot2)
A = read.csv('data/AirlinesCluster.csv')
summary(A)
Balance QualMiles BonusMiles BonusTrans
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 18528 1st Qu.: 0 1st Qu.: 1250 1st Qu.: 3.0
Median : 43097 Median : 0 Median : 7171 Median :12.0
Mean : 73601 Mean : 144 Mean : 17145 Mean :11.6
3rd Qu.: 92404 3rd Qu.: 0 3rd Qu.: 23800 3rd Qu.:17.0
Max. :1704838 Max. :11148 Max. :263685 Max. :86.0
FlightMiles FlightTrans DaysSinceEnroll
Min. : 0 Min. : 0.00 Min. : 2
1st Qu.: 0 1st Qu.: 0.00 1st Qu.:2330
Median : 0 Median : 0.00 Median :4096
Mean : 460 Mean : 1.37 Mean :4119
3rd Qu.: 311 3rd Qu.: 1.00 3rd Qu.:5790
Max. :30817 Max. :53.00 Max. :8296
🗿 為甚麼要做資料常態化?
colMeans(A) %>% sort
FlightTrans BonusTrans QualMiles FlightMiles DaysSinceEnroll
1.3736 11.6019 144.1145 460.0558 4118.5594
BonusMiles Balance
17144.8462 73601.3276
AN = scale(A) %>% data.frame
sapply(AN, mean)
Balance QualMiles BonusMiles
0.000000000000000027654 0.000000000000000026507 -0.000000000000000042736
BonusTrans FlightMiles FlightTrans
-0.000000000000000071911 0.000000000000000014818 0.000000000000000010741
DaysSinceEnroll
0.000000000000000055637
sapply(AN, sd)
Balance QualMiles BonusMiles BonusTrans FlightMiles
1 1 1 1 1
FlightTrans DaysSinceEnroll
1 1
1.距離矩陣
d = dist(AN, method="euclidean")
2.層級式集群分析
hc = hclust(d, method='ward.D')
3.畫出樹狀圖
plot(hc)
🗿 如何從樹狀圖決定群數?
4.分割群組
kg = cutree(hc, k=5)
table(kg)
kg
1 2 3 4 5
776 519 494 868 1342
sapply(split(A,kg), colMeans) %>% round(2)
1 2 3 4 5
Balance 57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles 0.64 1065.98 30.35 4.85 2.51
BonusMiles 10360.12 22881.76 55795.86 20788.77 2264.79
BonusTrans 10.82 18.23 19.66 17.09 2.97
FlightMiles 83.18 2613.42 327.68 111.57 119.32
FlightTrans 0.30 7.40 1.07 0.34 0.44
DaysSinceEnroll 6235.36 4402.41 5615.71 2840.82 3060.08
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
🗿 討論問題:
■ 請你們為這五個族群各起一個名稱
■ 請你們為這五個族群各設計一個行銷策略
■ 統計上最好的分群也是實務上最好的分群嗎
■ 除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?
?