批發商資料集
W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)
lm
: 方法md
: 模型Milk
: 目標變數 Response, Dependent Variable (DV)Grocery
: 預測變數 Predictor, Independent Variable (IV)W
: 資料 [1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "xlevels" "call" "terms" "model"
md$coefficient
: \(b_0\), \(b_1\) - 係數估計值md$fitted.value
: \(\hat{y}_i\) - 目標變數估計值md$residuals
: \(e_i = y - \hat{y}\) - 殘差[1] -0.0000000000000015543 0.0000000000000257572
[1] -0.0000000000000255906 0.0000000000000015543
\[ Milk_i = b_0 + b_1 Grocery_i\]
par(cex=0.8, mar=c(4,4,1,1))
plot(W$Grocery, W$Milk, pch=20, col="#80808080")
abline(b0, b1, col='red')
ggplot(aes(Grocery, Milk), data=W) +
geom_point(alpha=0.4, size=0.8) +
geom_smooth(method="lm", level=0.95, col="red", lwd=0.2) -> p
ggplotly(p)
🗿 : 為什麼大部分的資料點都沒有落在95%信心區間呢?
💡 : 模型估計的是\(y\)的平均值(\(\bar{Y}|x\))、而不是\(y\)本身!
Call:
lm(formula = Milk ~ Grocery, data = W)
Residuals:
Min 1Q Median 3Q Max
-1.1371 -0.1875 0.0219 0.1593 1.8732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8318 0.1115 7.46 0.00000000000047 ***
Grocery 0.7352 0.0301 24.39 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared: 0.576, Adjusted R-squared: 0.575
F-statistic: 595 on 1 and 438 DF, p-value: <0.0000000000000002
💡 : \(b_1\) 係數代表平均而言, \(x\) 每增加一單位時,\(y\) 會增加的數量
參數:平均值向量與變異(共變)矩陣
[,1] [,2]
[1,] 1.0 0.5
[2,] 0.5 1.0
隨機抽樣 A Random Sample of 20 Points
使用樣本建立模型: Models with 1 and 2 Parameters
Model-0: \(y_i = b_0 + \epsilon_i\)
Model-1: \(y_i = b_0 + b_1 x_i + \epsilon_i\)
m0 = lm(y ~ 1, A)
m2 = lm(y ~ x, A)
plot(A,pch=20,xlim=c(0,5.2),ylim=c(0,5.2))
abline(h=coef(m0), lty=3)
abline(coef(m2)[1], coef(m2)[2], col="blue")
SST (Total Sum of Sq.) = \(\Sigma_i(y_i - \bar{y}_i)^2\)
SSE (Error Sum of Sq.) = \(\Sigma_i(y_i - \hat{y_i})^2 = \Sigma_i e_i^2\)
SSR (Regression Sum of Sq.) = SST - SSE
\(R^2\) (判定係數 Determination Coef.) = SSR/SST,模型所能解釋目標變數的變異的能力
SST = sum( (y - mean(y))^2 ) # Total Sum of Sq.
SSE = sum( (y - md$fitted.values)^2 ) # Error Sum of Sq.
SSR = SST - SSE # Regression Sum of sql
R2 = SSR/SST # The Propotion of Variance explained by Regressor
c(SST=SST, SSE=SSE, SSR=SSR, R2=R2)
SST SSE SSR R2
96.82289 41.06697 55.75591 0.57585
[1] -0.00000000000000012156
lm()
裡面類別和連續預測變數的寫法是一樣的,做模型時,一個有k個分類的類別變數會產生(k-1)個虛擬變數(Dummy Variables) …
Call:
lm(formula = Grocery ~ Region, data = W)
Residuals:
Min 1Q Median 3Q Max
-3.181 -0.327 0.008 0.356 1.310
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6297 0.0552 65.79 <0.0000000000000002 ***
RegionReg2 0.1499 0.0896 1.67 0.095 .
RegionReg3 0.0282 0.0615 0.46 0.647
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.484 on 437 degrees of freedom
Multiple R-squared: 0.00707, Adjusted R-squared: 0.00252
F-statistic: 1.56 on 2 and 437 DF, p-value: 0.212
💡 : The idea of Dummy Variables
\(\hat{y}_i = b_0 + b_1 Reg2 + b_2 Reg3\)
\(\hat{y}_i = 3.6297 + 0.1499 \times Reg2 + 0.0282 \times Reg3\)
\(\hat{y}_{reg1} = 3.6297\)
\(\hat{y}_{reg2} = 3.6297 + 0.1499\)
\(\hat{y}_{reg3} = 3.6297 + 0.0282\)
Df Sum Sq Mean Sq F value Pr(>F)
Region 2 0.7 0.365 1.56 0.21
Residuals 437 102.4 0.234
\(p=0.21 > 0.05\) 不能拒絕各區域(Region
)雜貨購貨量(Grocery
)的平均值之間沒有差異的虛無假設
💡 : 其實做Simple Regression和ANOVA檢定之前分別都有一些前提假設和殘差分析需要確認, 詳情請看教科書
當某些前提假設不能成立時,我們可以借助無母數(non-parametric)的檢定方法
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
ggpubr
套件 裡面有自動化工具可以做整體和個別(事後Post-Hoc)檢定並畫出圖形
compare <- list( c("0.5", "1"), c("1", "2"), c("0.5", "2") )
ggboxplot(
df, x = "dose", y = "len",
color = "dose", palette =c("#00AFBB", "#E7B800", "#FC4E07"),
add = "jitter", shape = "dose") +
stat_compare_means(comparisons = compare) +
stat_compare_means(label.y = 50)