UNIT13A：類別模型、預測機率與商業決策

pacman::p_load(caTools, ggplot2, dplyr)

【A】整理資料、建立模型

資料：糖尿病保戶醫療紀錄

D = read.csv("data/quality.csv")  # Read in dataset
summary(D)

    MemberID     InpatientDays      ERVisits     OfficeVisits    Narcotics    
 Min.   :  1.0   Min.   : 0.00   Min.   : 0.0   Min.   : 0.0   Min.   : 0.00  
 1st Qu.: 33.5   1st Qu.: 0.00   1st Qu.: 0.0   1st Qu.: 7.0   1st Qu.: 0.00  
 Median : 66.0   Median : 0.00   Median : 1.0   Median :12.0   Median : 1.00  
 Mean   : 66.0   Mean   : 2.72   Mean   : 1.5   Mean   :13.2   Mean   : 4.57  
 3rd Qu.: 98.5   3rd Qu.: 3.00   3rd Qu.: 2.0   3rd Qu.:18.5   3rd Qu.: 3.00  
 Max.   :131.0   Max.   :30.00   Max.   :11.0   Max.   :46.0   Max.   :59.00  
 DaysSinceLastERVisit      Pain        TotalVisits   ProviderCount
 Min.   :  6          Min.   :  0.0   Min.   : 0.0   Min.   : 5   
 1st Qu.:207          1st Qu.:  1.0   1st Qu.: 8.0   1st Qu.:15   
 Median :641          Median :  8.0   Median :15.0   Median :20   
 Mean   :481          Mean   : 15.6   Mean   :17.4   Mean   :24   
 3rd Qu.:731          3rd Qu.: 23.0   3rd Qu.:22.5   3rd Qu.:30   
 Max.   :731          Max.   :104.0   Max.   :69.0   Max.   :82   
 MedicalClaims     ClaimLines    StartedOnCombination AcuteDrugGapSmall
 Min.   : 11.0   Min.   : 20.0   Mode :logical        Min.   : 0.00    
 1st Qu.: 25.5   1st Qu.: 83.5   FALSE:125            1st Qu.: 0.00    
 Median : 37.0   Median :120.0   TRUE :6              Median : 1.00    
 Mean   : 43.2   Mean   :142.9                        Mean   : 2.69    
 3rd Qu.: 49.5   3rd Qu.:185.0                        3rd Qu.: 3.00    
 Max.   :194.0   Max.   :577.0                        Max.   :71.00    
    PoorCare    
 Min.   :0.000  
 1st Qu.:0.000  
 Median :0.000  
 Mean   :0.252  
 3rd Qu.:0.500  
 Max.   :1.000

基礎機率

base = table(D$PoorCare) # the base probability

分割資料 (分割向量)

set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75)  # split vector
table(split) %>% prop.table()

split
  FALSE    TRUE 
0.24427 0.75573

目標變數與分割向量的列聯表

table(D$PoorCare, split) %>% prop.table(2)

   split
      FALSE    TRUE
  0 0.75000 0.74747
  1 0.25000 0.25253

使用分割向量分割資料

TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)

建立模型

glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)


Call:
glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
    data = TR)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0630  -0.6316  -0.5050  -0.0969   2.1669  

Coefficients:
             Estimate Std. Error z value   Pr(>|z|)    
(Intercept)   -2.6461     0.5236   -5.05 0.00000043 ***
OfficeVisits   0.0821     0.0305    2.69     0.0072 ** 
Narcotics      0.0763     0.0321    2.38     0.0173 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 111.888  on 98  degrees of freedom
Residual deviance:  89.127  on 96  degrees of freedom
AIC: 95.13

Number of Fisher Scoring iterations: 4

【B】模型、係數與效果

模型：

Pr[y = 1] = 1/(1+exp(-f(x)))
\(Logit = f(x) = b_0 + b_1 x_1 + b_2 x_2 \;\; (1)\)
\(Logit = f(x) = -2.6461 + 0.0821 \times OfficeVisits + 0.0763 \times Narcotics \;\; (2)\)

係數：

\(Odd_0 = Exp(b_0 + b_1 x_1)\;\;(3)\)
\(Odd_1 = Exp[b_0 + b_1(x_1+1)] = Exp(b_0 + b_1 x_1 + b_1) = Exp(b_0 + b_1 x_1) \times Exp(b_1) \;\;(4)\)
\(Odd_1 = Odd_0 \times Exp(b_1) \:\:(5)\)
\(\frac{Odd_1}{Odd_0} = Exp(b_1) \:\:(6)\)

機率和勝率之間的關係

pop = function(p, k) {o = p/(1-p);  o = k * o; o/(1+o)}
p0 = seq(0.1, 0.9, 0.1); k = 2
p1 = sapply(seq(0.1, 0.9, 0.1), pop, k)
data.frame(k, p0, p1, difference=p1-p0, multiplier=p1/p0) %>% round(2)

  k  p0   p1 difference multiplier
1 2 0.1 0.18       0.08       1.82
2 2 0.2 0.33       0.13       1.67
3 2 0.3 0.46       0.16       1.54
4 2 0.4 0.57       0.17       1.43
5 2 0.5 0.67       0.17       1.33
6 2 0.6 0.75       0.15       1.25
7 2 0.7 0.82       0.12       1.18
8 2 0.8 0.89       0.09       1.11
9 2 0.9 0.95       0.05       1.05

變數的邊際效果

df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")

      1 
0.17018

df = data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")

      1 
0.18209

df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics)+1)
predict(glm1, df, type="response")

      1 
0.18123

df = data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics)+1)
predict(glm1, df, type="response")

      1 
0.19373

💡 學習重點：
■ 係數的指數就是勝率比；也就是說，\(x_i\) 每增加一，勝率(\(Odd[y = 1]\))會變成原來的 \(Exp(b_i)\) 倍
■ 各預測變數的(勝率)效果是相乘，而不是相加
■ 機率和勝率之間的關係並不是線性的：
■ 邏輯式回歸裡面各預測變數的勝率效果是固定的
■ 但是他們的機率效果並不是固定的
■ 我們需先推算原先的機率，才能推算變數的機率效果

quantile(D$OfficeVisits)

  0%  25%  50%  75% 100% 
 0.0  7.0 12.0 18.5 46.0

quantile(D$Narcotics)

  0%  25%  50%  75% 100% 
   0    0    1    3   59

🗿 練習：
■ 當OfficeVisits和Narcotic分別等於他們的第一分位(Q1)時：
■ PoorCare = 1的機率是？
■ 兩個自變數的勝率效果分別是？
■ 兩個自變數的機率效果分別是？
■ 當OfficeVisits和Narcotic分別等於他們的第三分位(Q3)時：
■ PoorCare = 1的機率是？
■ 兩個自變數的勝率效果分別是？
■ 兩個自變數的機率效果分別是？
■ 比較以上兩個題目的答案，我們可以觀察到什麼？

UNIT13A：類別模型、預測機率與商業決策

案例一：預防性醫療

中山大學管理學院卓雍然

2019-05-26 18:14:41

【A】整理資料、建立模型

【B】模型、係數與效果

UNIT13A：類別模型、預測機率與商業決策

案例一：預防性醫療

中山大學管理學院 卓雍然

2019-05-26 18:14:41

【A】整理資料、建立模型

【B】模型、係數與效果

中山大學管理學院卓雍然