pacman::p_load(caTools, ggplot2, dplyr)
D = read.csv("data/quality.csv") # Read in dataset
summary(D)
MemberID InpatientDays ERVisits OfficeVisits Narcotics
Min. : 1.0 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
1st Qu.: 33.5 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 7.0 1st Qu.: 0.00
Median : 66.0 Median : 0.00 Median : 1.0 Median :12.0 Median : 1.00
Mean : 66.0 Mean : 2.72 Mean : 1.5 Mean :13.2 Mean : 4.57
3rd Qu.: 98.5 3rd Qu.: 3.00 3rd Qu.: 2.0 3rd Qu.:18.5 3rd Qu.: 3.00
Max. :131.0 Max. :30.00 Max. :11.0 Max. :46.0 Max. :59.00
DaysSinceLastERVisit Pain TotalVisits ProviderCount
Min. : 6 Min. : 0.0 Min. : 0.0 Min. : 5
1st Qu.:207 1st Qu.: 1.0 1st Qu.: 8.0 1st Qu.:15
Median :641 Median : 8.0 Median :15.0 Median :20
Mean :481 Mean : 15.6 Mean :17.4 Mean :24
3rd Qu.:731 3rd Qu.: 23.0 3rd Qu.:22.5 3rd Qu.:30
Max. :731 Max. :104.0 Max. :69.0 Max. :82
MedicalClaims ClaimLines StartedOnCombination AcuteDrugGapSmall
Min. : 11.0 Min. : 20.0 Mode :logical Min. : 0.00
1st Qu.: 25.5 1st Qu.: 83.5 FALSE:125 1st Qu.: 0.00
Median : 37.0 Median :120.0 TRUE :6 Median : 1.00
Mean : 43.2 Mean :142.9 Mean : 2.69
3rd Qu.: 49.5 3rd Qu.:185.0 3rd Qu.: 3.00
Max. :194.0 Max. :577.0 Max. :71.00
PoorCare
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.252
3rd Qu.:0.500
Max. :1.000
base = table(D$PoorCare) # the base probability
set.seed(88)
split = sample.split(D$PoorCare, SplitRatio = 0.75) # split vector
table(split) %>% prop.table()
split
FALSE TRUE
0.24427 0.75573
table(D$PoorCare, split) %>% prop.table(2)
split
FALSE TRUE
0 0.75000 0.74747
1 0.25000 0.25253
TR = subset(D, split == TRUE)
TS = subset(D, split == FALSE)
glm1 = glm(PoorCare ~ OfficeVisits + Narcotics, TR, family=binomial)
summary(glm1)
Call:
glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
data = TR)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0630 -0.6316 -0.5050 -0.0969 2.1669
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.6461 0.5236 -5.05 0.00000043 ***
OfficeVisits 0.0821 0.0305 2.69 0.0072 **
Narcotics 0.0763 0.0321 2.38 0.0173 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 89.127 on 96 degrees of freedom
AIC: 95.13
Number of Fisher Scoring iterations: 4
模型:
Pr[y = 1] = 1/(1+exp(-f(x)))
\(Logit = f(x) = b_0 + b_1 x_1 + b_2 x_2 \;\; (1)\)
\(Logit = f(x) = -2.6461 + 0.0821 \times OfficeVisits + 0.0763 \times Narcotics \;\; (2)\)
係數:
\(Odd_0 = Exp(b_0 + b_1 x_1)\;\;(3)\)
\(Odd_1 = Exp[b_0 + b_1(x_1+1)] = Exp(b_0 + b_1 x_1 + b_1) = Exp(b_0 + b_1 x_1) \times Exp(b_1) \;\;(4)\)
\(Odd_1 = Odd_0 \times Exp(b_1) \:\:(5)\)
\(\frac{Odd_1}{Odd_0} = Exp(b_1) \:\:(6)\)
機率和勝率之間的關係
pop = function(p, k) {o = p/(1-p); o = k * o; o/(1+o)}
p0 = seq(0.1, 0.9, 0.1); k = 2
p1 = sapply(seq(0.1, 0.9, 0.1), pop, k)
data.frame(k, p0, p1, difference=p1-p0, multiplier=p1/p0) %>% round(2)
k p0 p1 difference multiplier
1 2 0.1 0.18 0.08 1.82
2 2 0.2 0.33 0.13 1.67
3 2 0.3 0.46 0.16 1.54
4 2 0.4 0.57 0.17 1.43
5 2 0.5 0.67 0.17 1.33
6 2 0.6 0.75 0.15 1.25
7 2 0.7 0.82 0.12 1.18
8 2 0.8 0.89 0.09 1.11
9 2 0.9 0.95 0.05 1.05
變數的邊際效果
df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")
1
0.17018
df = data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics))
predict(glm1, df, type="response")
1
0.18209
df = data.frame(OfficeVisits = median(D$OfficeVisits), Narcotics=median(D$Narcotics)+1)
predict(glm1, df, type="response")
1
0.18123
df = data.frame(OfficeVisits = median(D$OfficeVisits)+1, Narcotics=median(D$Narcotics)+1)
predict(glm1, df, type="response")
1
0.19373
💡 學習重點:
■ 係數的指數就是勝率比;也就是說,\(x_i\) 每增加一,勝率(\(Odd[y = 1]\))會變成原來的 \(Exp(b_i)\) 倍
■ 各預測變數的(勝率)效果是相乘,而不是相加
■ 機率和勝率之間的關係並不是線性的:
■ 邏輯式回歸裡面各預測變數的勝率效果是固定的
■ 但是他們的機率效果並不是固定的
■ 我們需先推算原先的機率,才能推算變數的機率效果
quantile(D$OfficeVisits)
0% 25% 50% 75% 100%
0.0 7.0 12.0 18.5 46.0
quantile(D$Narcotics)
0% 25% 50% 75% 100%
0 0 1 3 59
🗿 練習:
■ 當OfficeVisits
和Narcotic
分別等於他們的第一分位(Q1
)時:
■ PoorCare = 1
的機率是?
■ 兩個自變數的勝率效果分別是?
■ 兩個自變數的機率效果分別是?
■ 當OfficeVisits
和Narcotic
分別等於他們的第三分位(Q3
)時:
■ PoorCare = 1
的機率是?
■ 兩個自變數的勝率效果分別是?
■ 兩個自變數的機率效果分別是?
■ 比較以上兩個題目的答案,我們可以觀察到什麼?