pacman::p_load(dplyr, ggplot2, plotly, ggpubr)

批發商資料集

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)


[A] 用R做線性回歸

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        


[B] 理論 vs 實證模型

[1] -0.0000000000000015543  0.0000000000000257572
[1] -0.0000000000000255906  0.0000000000000015543


[C] 畫出回歸線

\[ Milk_i = b_0 + b_1 Grocery_i\]


🗿 : 為什麼大部分的資料點都沒有落在95%信心區間呢?

💡 : 模型估計的是\(y\)的平均值(\(\bar{y}|x\))、而不是\(y\)本身!



[D] Model Summary 功能


Call:
lm(formula = Milk ~ Grocery, data = W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.1371 -0.1875  0.0219  0.1593  1.8732 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   0.8318     0.1115    7.46     0.00000000000047 ***
Grocery       0.7352     0.0301   24.39 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.306 on 438 degrees of freedom
Multiple R-squared:  0.576, Adjusted R-squared:  0.575 
F-statistic:  595 on 1 and 438 DF,  p-value: <0.0000000000000002

💡 : \(b_1\) 係數代表平均而言, \(x\) 每增加一單位時,\(y\) 會增加的數量


[E] 變異數分解 Decomposition of Variance

     SST      SSE      SSR       R2 
96.82289 41.06697 55.75591  0.57585 
[1] -0.00000000000000012156


💡 : 因為 Cov(\(\hat{y}\), \(e\)) = 0, 所以 Var(\(y\)) = Var(\(\hat{y}+e\)) = Var(\(\hat{y}\)) + Var(\(e\))



[F] 變異數分析 Analysis of Variance (ANOVA)

當預測變數是類別變數時 …


Call:
lm(formula = Grocery ~ Region, data = W)

Residuals:
   Min     1Q Median     3Q    Max 
-3.181 -0.327  0.008  0.356  1.310 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   3.6297     0.0552   65.79 <0.0000000000000002 ***
RegionReg2    0.1499     0.0896    1.67               0.095 .  
RegionReg3    0.0282     0.0615    0.46               0.647    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.484 on 437 degrees of freedom
Multiple R-squared:  0.00707,   Adjusted R-squared:  0.00252 
F-statistic: 1.56 on 2 and 437 DF,  p-value: 0.212


💡 : The idea of Dummy Variables


ANOVA 檢定
             Df Sum Sq Mean Sq F value Pr(>F)
Region        2    0.7   0.365    1.56   0.21
Residuals   437  102.4   0.234               

\(p=0.21 > 0.05\) 不能拒絕各區域(Region)雜貨購貨量(Grocery)的平均值之間沒有差異的虛無假設

💡 : 其實做Simple Regression和ANOVA檢定之前分別都有一些前提假設和殘差分析需要確認, 詳情請看教科書