第六單元：基礎統計 with R

單元摘要：使用R複習並練習操作「基礎統計」

統計量 Statistics
- 基本的統計量(函數)
R內建的機率分布與其功能函數
- 機率分布與功能函數
常見的連續的機率分布 Continuous Probability Distributions
- 常態分布 Normal Distributions
- 卡方分布 Chi-Square Distribution
- F分布 F Distribution
- Student-\(t\) Distribution
- Uniform Distribution
常見的離散的機率分布 Discrete Probability Distributions
- 二項分布 Binomial Dist.
- 超幾何分布 Hypergeometric Dist.
- 布瓦松分布 Poisson Dist.
- 幾何分布 Geometric Dist.
- 負二項分布 Negative Binominal Dist.
- 多項分布 Multinomial Dist.
實證分布應用案例
- 老忠實資料集
理論分布的實證應用案例
- Applications of Poisson Dist.

pacman::p_load(magrittr, vcd)

【A】統計量 Statistics

統計量(statistics)就是一群數值的摘要(summary)，常用的統計量有：

平均值 Mean：mean()
中位數：median()
最大值：max()
最小值：min()
標準差 Std. Deviation：sd() 平均誤差
變異數 Variance：var() 平均誤差的平方
百分位：quantile()

🧙 : 在標準差和變異數的定義裡面，所謂誤差是指對什麼的誤差呢？

我們先隨機在0到100之間產生50個的亂數

X = runif(50, 0, 100)
X

 [1] 67.0389  1.4145 71.2882 49.0196 86.7699 91.6683 20.2658 74.3490  2.5239
[10] 87.5861 60.9242 43.2304 89.3586  8.6413 27.6843 26.5986 86.0888 39.4637
[19] 26.3993 42.4937 71.6155  3.2579 58.9096 99.9434 73.7333 79.2971 69.3592
[28] 69.0265 73.6900 89.8273  9.8699  2.3447  1.4221 16.2632 70.4671 61.6363
[37] 97.3049 31.8472 22.3043 83.5507 79.3931 80.4565 60.7377 36.7453 26.0460
[46] 37.0256 51.9973 20.9974 86.5447 73.9763

summary(X)           # 基本統計量

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.42   26.45   60.83   52.85   78.06   99.94

c(sd(X), var(X))     # 標準差、變異數

[1]  30.282 917.022

quantile(X, seq(0,1,1/10))  # 百分位數

     0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
 1.4145  8.1029 22.0429 30.5983 42.9357 60.8310 69.8024 73.8062 81.0754 87.7634 
   100% 
99.9434

除了這些統計量之外，我們也可以利用圖形來幫助我們(對於這一群數字)建立一個整體性的認識，直方圖可以顯示變數的值所出現的頻率(次數)或機率(密度)

par(mfrow=c(1,2), cex=0.8)
hist(X, 10, col='aliceblue', border='lightgray')         # frquency
hist(X, 10, col='aliceblue', border='lightgray', freq=F) # density

【B】R內建的機率分布與其功能函數

R有一群內建的機率分布

beta : dbeta()
binomial : dbinom()
Cauchy : dcauchy()
chi-squared : dchisq()
exponential : dexp()
F : df()
gamma : dgamma()
geometric : `dgeom
hypergeometric : dhyper()
log-normal : dlnorm()
multinomial : dmultinom()
negative binomial : dnbinom()
normal : dnorm()
Poisson : dpois()
Student’s t : dt()
uniform : dunif()
Weibull : dweibull()
…其他幾乎所有叫得出名字的機率分布都可以透過套件外掛

?distributions

每一種機率分布都有四個功能函數：

d<name>() : density or probability function
p<name>() : cumulative density function
q<name>() : quantile function
r<name>() : random sampling

B1. 機率分布的參數

par(mfrow=c(2,3), mar=c(2,4,3,2), cex=0.6)
for(n in c(10, 20)) for(p in c(0.2, 0.5, 0.8)) 
  dbinom(0:n, n, p) %>% 
  barplot(names=0:n, main=sprintf("Binom[n=%d, p=%.1f]", n, p), las=2)

B2. `d<name>: d for density` 機率密度函數 (PDF)

d<name>用來求取對應到某一數值的機率密度，給定機率分佈的種類和它所需要的「參數」，我們就能確定「值域」之中每一點的機率密度； d<name>通常用來畫機率密度函數

par(mfrow=c(2,2), mar=c(2,4,3,2), cex=0.7)
curve(dnorm(x,mean=50,sd=15),0,100,main="Normal")
curve(dunif(x,min=0,max=100),0,100,main="Uniform")
curve(dlnorm(x,meanlog=log(50),sdlog=1.5),0,100,main="Log-Normal")
curve(dexp(x,rate=0.05),0,100,main="Exponential")

傳回Normal[mean=40,sd=15]這個機率密度函數在x = 40這一點的值

dnorm(x=40, mean=50, sd=15)

[1] 0.021297

傳回Normal[mean=40,sd=15]這個機率密度函數在x = c(0,20,40,60,80,100)這些點的值

dnorm(seq(0,100,20), mean=50, sd=15)

[1] 0.00010282 0.00359940 0.02129653 0.02129653 0.00359940 0.00010282

畫出機率分布的形狀

par(mfrow=c(1,2), mar=c(2,2,1,2),cex=0.7)
x = seq(0,100,1)
y = dnorm(x, mean=50, sd=15) 
plot(x, y, type='l')
# curve() is a easier way to plot a function (or expresion) of x
curve(dnorm(x,mean=50,sd=15),0,100)

B3. `r<name>: r for random` 隨機抽樣

r<name>常用來從給定的機率分佈之中隨機抽出樣本

norm50 = rnorm(n=50, mean=50, sd=10)
unif50 = runif(n=50, min=0, max=100)

par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.6)
hist(norm50,breaks=10)
hist(unif50,breaks=10)

par(mfrow=c(3,2), mar=c(3,2,2,2),cex=0.6)
for(n in c(20, 200, 2000)) {
  hist(rnorm(n, mean=50, sd=10),breaks=10,main=paste("Normal",n))
  hist(runif(n, min=0, max=100),breaks=10,main=paste("Uniform",n))
  }

B4. `p<name>: p for probability` 累積機率函數 (CDF)

p<name>求取變數落在某數值區間的機率，給定機率分佈，我們常會想要知道目標變數會落在某一個數值空間的機率，如果隨機變數X ~ Normal[mu=50, sd=10]，pnorm(x=55, 50, 10)會傳回X < 55的機率；

mu=50; sd=10
pnorm(55, mu, sd)

[1] 0.69146

🧙 問題討論：
A. 如果隨機變數X ~ Normal[mu=100, sd=20]，請求出以下機率：
■ P[X < 110]
■ P[130 > X > 110]
■ P[X > 130]
B. 請畫出圖形來解釋你的運算

x1 = 90; x2 = 130
p1 = pnorm(x1,100,20); p2 = pnorm(x2,100,20)
c(p1, p2 - p1, 1 - p2)

[1] 0.308538 0.624655 0.066807

par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.7)
curve(dnorm(x,100,20),40,160,main="PDF")
abline(v=c(x1,x2),col='gray',lty=3)
x = seq(x1,x2,length=100)
polygon(c(x, x2, x1), c(dnorm(x,100,20), 0, 0), col="#00E9001F", border=NA)
#
curve(pnorm(x,100,20),40,160,main="CDF")
abline(v=c(x1,x2),col='gray',lty=3)
abline(h=c(p1,p2),col='pink')

B5. `q<name>: q for quantile` 百分位函數

q<name>通常用來求取對應到某一機率的臨界值，如果隨機變數X ~ Normal[mu, sd]，qnorm(p=0.8, mu, sd)會傳回Pr[X < x1] = 0.8的臨界值x1；

qnorm(p=0.8, 100, 20)

[1] 116.83

🧙 問題討論：
A. 如果隨機變數X ~ Normal[mu=100, sd=20]，請求X的90%信賴區間
B. 請畫出圖形來解釋你的運算

【C】實證分布的應用 Empirical Distribution

CASE STUDY: THE OLD FAITHFUL

【D】理論分布 Theoretical Distribution

D1. Binomial vs. Normal Distribution

💡 : When \(n\) is large, \(Binom[n, p]\) approaches \(Norm[\mu = n p, \sigma=\sqrt{n p (1-p)}]\)

💡 : \(X \sim Binom[n, p] \, \Rightarrow \, Exp(X) = n \cdot p \, , \, Var(X) = n \cdot p \cdot (1-p)\)

par(mfrow=c(1,1), cex=0.7)
n = 1000; p = 0.2
rbinom(500000, n, p) %>% hist(breaks=80, freq=F)
curve(dnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), col='red', lwd=2, add=T)

💡 : When \(n \times p\) is not large enough, Binomial degenerates into a discrete distribution.

par(mfrow=c(1,2), cex=0.7)
n = 10; p = 0.2
rbinom(100000, n, p) %>% hist(freq=F, breaks=(0:n)-0.01)
rnorm(100000, n*p, sqrt(n*p*(1-p))) %>% hist(freq=F)

D2. Binomial vs. Poisson Distribution

💡 : When \(n\) is large and \(p\) is small, \(Binom[n, p]\) approaches \(Pois[\lambda = n p]\)

par(mfrow=c(1,2), cex=0.7)
rbinom(100000, 1000, 0.002) %>% table %>% barplot(main="Boinomial")
rpois(100000, 2)  %>% table %>% barplot(main="Poisson")

💡 : \(X \sim Pois[\lambda] \, \Rightarrow \, E(X) = Var(X) = \lambda\)

💡 : \(X \sim Pois[\lambda_1], \, Y \sim Pois[\lambda_2] \, \Rightarrow \, X+Y \sim Pois[\lambda_1 + \lambda_2]\)

par(mfrow=c(1,2), cex=0.7)
(rpois(100000, 1) + rpois(100000, 2)) %>% table %>% barplot(main="Pois[1] + Pois[2]")
rpois(100000, 3)  %>% table %>% barplot(main="Pois[3]")

dpois(0:10, 3) + dpois(0:10, 4) - dpois(0:10, 7)

 [1]  0.0671908  0.2162406  0.3482258  0.3672794  0.2721720  0.1293956
 [7]  0.0056023 -0.0678584 -0.0925057 -0.0854730 -0.0648806

D3. Geometric Dist. - the Dist. of Waiting Time

💡 : We can simulating Geometrics Dis. by Binomial Dist.

par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1) )
replicate(100000, which(rbinom(100, 1, .2) == 1)[1] - 1) %>% 
  table %>% barplot(main="Binomial Simulation")
rgeom(100000, 0.2) %>% table %>% barplot(main="Geometric")

💡 : \(X \sim Geom[p] \, \Rightarrow \, E[X] = \frac{1}{p}-1\)

If a machine has a probabaility of being broken in any given day, what is the probability that this machine is still workng in 30 days.

par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1))
pgeom(0:29, prob=0.1) %>% 
  barplot(names=1:30, main="broke by the end of the #th day")
(1- pgeom(0:29, prob=0.1)) %>% 
  barplot(names=1:30, main="still working by the end of the #th day")

【E】理論分布的實證應用

E1. Death by Horse Kick

The Data

par(mfrow=c(1,1), cex=0.7)
(HK = HorseKicks)

nDeaths
  0   1   2   3   4 
109  65  22   3   1

Does it comply to Poisson Distribution?

fit = goodfit(HK, type = "poisson")
summary(fit)


     Goodness-of-fit test for poisson distribution

                     X^2 df P(> X^2)
Likelihood Ratio 0.86822  3  0.83309

What is the \(\lambda\)?

fit$par

$lambda
[1] 0.61

What is the probability of nDeath >= 2?

1 - ppois(1, fit$par$lambda)

[1] 0.12521

🧙 問題討論：
如果保險公司想要為國防部設計一個被馬踢死的保險：
■ 如果你只要只靠HorseKick這一份數據，每一軍團每年被馬踢死的次數超過5次的機率是多？
■ 如果我們將數據fit到理論分布上面，根據理論分佈，被馬踢死的次數超過5次的機率是多？
■ 以上哪一種做法才是比較合理的做法的？

What is the probability of nDeath >= 5?

1 - ppois(4, fit$par$lambda)

[1] 0.00042497

E2. “May” in Federalist Papers

(Fed = Federalist)

nMay
  0   1   2   3   4   5   6 
156  63  29   8   4   1   1

Does it comply to Poisson Distribution?

fit <- goodfit(Fed, type = "poisson")
summary(fit)


     Goodness-of-fit test for poisson distribution

                    X^2 df   P(> X^2)
Likelihood Ratio 25.243  5 0.00012505

Does it comply to Negtive Binomial Distribution?

fit = goodfit(Fed, type = "nbinomial")
summary(fit)


     Goodness-of-fit test for nbinomial distribution

                   X^2 df P(> X^2)
Likelihood Ratio 1.964  4  0.74238

What are the parameters?

fit$par

$size
[1] 1.1863

$prob
[1] 0.64376

How does the distribution looks like?

dnbinom(0:10, fit$par$size, fit$par$prob)

 [1] 0.593037483 0.250629638 0.097602955 0.036929829 0.013768785 0.005087807
 [7] 0.001868776 0.000683457 0.000249147 0.000090594 0.000032875

How is the probability that 2 <= nMay <= 6?

pnbinom(6, fit$par$size, fit$par$prob) - pnbinom(1, fit$par$size, fit$par$prob)

[1] 0.15526

🧙 問題討論：
在cup.csv檔案裡面是一千個信徒的擲筊次數，假定每一個人都是擲到3次成功才停止，請問：
■ 這個筊的成功機率大約是？
■ 請畫出用這個筊擲10次成功之前，失敗次數的機率分布？
■ 用這個筊擲15次還不能有5次成功的機率是？

第六單元：基礎統計 with R

中山大學管理學院卓雍然

2019-03-30 15:21:27

【A】統計量 Statistics

【B】R內建的機率分布與其功能函數

B1. 機率分布的參數

B2. `d<name>: d for density` 機率密度函數 (PDF)

B3. `r<name>: r for random` 隨機抽樣

B4. `p<name>: p for probability` 累積機率函數 (CDF)

B5. `q<name>: q for quantile` 百分位函數

【C】實證分布的應用 Empirical Distribution

【D】理論分布 Theoretical Distribution

D1. Binomial vs. Normal Distribution

D2. Binomial vs. Poisson Distribution

D3. Geometric Dist. - the Dist. of Waiting Time

【E】理論分布的實證應用

E1. Death by Horse Kick

E2. “May” in Federalist Papers

第六單元：基礎統計 with R

中山大學管理學院 卓雍然

2019-03-30 15:21:27

【A】統計量 Statistics

【B】R內建的機率分布與其功能函數

B1. 機率分布的參數

B2. d<name>: d for density 機率密度函數 (PDF)

B3. r<name>: r for random 隨機抽樣

B4. p<name>: p for probability 累積機率函數 (CDF)

B5. q<name>: q for quantile 百分位函數

【C】實證分布的應用 Empirical Distribution

【D】理論分布 Theoretical Distribution

D1. Binomial vs. Normal Distribution

D2. Binomial vs. Poisson Distribution

D3. Geometric Dist. - the Dist. of Waiting Time

【E】理論分布的實證應用

E1. Death by Horse Kick

E2. “May” in Federalist Papers

中山大學管理學院卓雍然

B2. `d<name>: d for density` 機率密度函數 (PDF)

B3. `r<name>: r for random` 隨機抽樣

B4. `p<name>: p for probability` 累積機率函數 (CDF)

B5. `q<name>: q for quantile` 百分位函數