單元摘要:使用R複習並練習操作「基礎統計」


pacman::p_load(magrittr, vcd)

【A】統計量 Statistics

統計量(statistics)就是一群數值的摘要(summary),常用的統計量有:

🧙 : 在標準差和變異數的定義裡面,所謂誤差是指對什麼的誤差呢?

我們先隨機在0到100之間產生50個的亂數

X = runif(50, 0, 100)
X
 [1] 67.0389  1.4145 71.2882 49.0196 86.7699 91.6683 20.2658 74.3490  2.5239
[10] 87.5861 60.9242 43.2304 89.3586  8.6413 27.6843 26.5986 86.0888 39.4637
[19] 26.3993 42.4937 71.6155  3.2579 58.9096 99.9434 73.7333 79.2971 69.3592
[28] 69.0265 73.6900 89.8273  9.8699  2.3447  1.4221 16.2632 70.4671 61.6363
[37] 97.3049 31.8472 22.3043 83.5507 79.3931 80.4565 60.7377 36.7453 26.0460
[46] 37.0256 51.9973 20.9974 86.5447 73.9763
summary(X)           # 基本統計量
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.42   26.45   60.83   52.85   78.06   99.94 
c(sd(X), var(X))     # 標準差、變異數
[1]  30.282 917.022
quantile(X, seq(0,1,1/10))  # 百分位數
     0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
 1.4145  8.1029 22.0429 30.5983 42.9357 60.8310 69.8024 73.8062 81.0754 87.7634 
   100% 
99.9434 

除了這些統計量之外,我們也可以利用圖形來幫助我們(對於這一群數字)建立一個整體性的認識,直方圖可以顯示變數的值所出現的頻率(次數)或機率(密度)

par(mfrow=c(1,2), cex=0.8)
hist(X, 10, col='aliceblue', border='lightgray')         # frquency
hist(X, 10, col='aliceblue', border='lightgray', freq=F) # density


【B】R內建的機率分布與其功能函數

R有一群內建的機率分布

?distributions

每一種機率分布都有四個功能函數:

B1. 機率分布的參數
par(mfrow=c(2,3), mar=c(2,4,3,2), cex=0.6)
for(n in c(10, 20)) for(p in c(0.2, 0.5, 0.8)) 
  dbinom(0:n, n, p) %>% 
  barplot(names=0:n, main=sprintf("Binom[n=%d, p=%.1f]", n, p), las=2)



B2. d<name>: d for density 機率密度函數 (PDF)

d<name>用來求取對應到某一數值的機率密度, 給定機率分佈的種類和它所需要的「參數」,我們就能確定「值域」之中每一點的機率密度; d<name>通常用來畫機率密度函數

par(mfrow=c(2,2), mar=c(2,4,3,2), cex=0.7)
curve(dnorm(x,mean=50,sd=15),0,100,main="Normal")
curve(dunif(x,min=0,max=100),0,100,main="Uniform")
curve(dlnorm(x,meanlog=log(50),sdlog=1.5),0,100,main="Log-Normal")
curve(dexp(x,rate=0.05),0,100,main="Exponential")


傳回Normal[mean=40,sd=15]這個機率密度函數在x = 40這一點的值

dnorm(x=40, mean=50, sd=15)
[1] 0.021297

傳回Normal[mean=40,sd=15]這個機率密度函數在x = c(0,20,40,60,80,100)這些點的值

dnorm(seq(0,100,20), mean=50, sd=15)
[1] 0.00010282 0.00359940 0.02129653 0.02129653 0.00359940 0.00010282

畫出機率分布的形狀

par(mfrow=c(1,2), mar=c(2,2,1,2),cex=0.7)
x = seq(0,100,1)
y = dnorm(x, mean=50, sd=15) 
plot(x, y, type='l')
# curve() is a easier way to plot a function (or expresion) of x
curve(dnorm(x,mean=50,sd=15),0,100)



B3. r<name>: r for random 隨機抽樣

r<name>常用來從給定的機率分佈之中隨機抽出樣本

norm50 = rnorm(n=50, mean=50, sd=10)
unif50 = runif(n=50, min=0, max=100)
par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.6)
hist(norm50,breaks=10)
hist(unif50,breaks=10)

par(mfrow=c(3,2), mar=c(3,2,2,2),cex=0.6)
for(n in c(20, 200, 2000)) {
  hist(rnorm(n, mean=50, sd=10),breaks=10,main=paste("Normal",n))
  hist(runif(n, min=0, max=100),breaks=10,main=paste("Uniform",n))
  }



B4. p<name>: p for probability 累積機率函數 (CDF)

p<name>求取變數落在某數值區間的機率,給定機率分佈,我們常會想要知道目標變數會落在某一個數值空間的機率,如果隨機變數X ~ Normal[mu=50, sd=10]pnorm(x=55, 50, 10)會傳回X < 55的機率;

mu=50; sd=10
pnorm(55, mu, sd)
[1] 0.69146

🧙 問題討論:
A. 如果隨機變數X ~ Normal[mu=100, sd=20],請求出以下機率:
  ■ P[X < 110]
  ■ P[130 > X > 110]
  ■ P[X > 130]
B. 請畫出圖形來解釋你的運算

x1 = 90; x2 = 130
p1 = pnorm(x1,100,20); p2 = pnorm(x2,100,20)
c(p1, p2 - p1, 1 - p2)
[1] 0.308538 0.624655 0.066807
par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.7)
curve(dnorm(x,100,20),40,160,main="PDF")
abline(v=c(x1,x2),col='gray',lty=3)
x = seq(x1,x2,length=100)
polygon(c(x, x2, x1), c(dnorm(x,100,20), 0, 0), col="#00E9001F", border=NA)
#
curve(pnorm(x,100,20),40,160,main="CDF")
abline(v=c(x1,x2),col='gray',lty=3)
abline(h=c(p1,p2),col='pink')



B5. q<name>: q for quantile 百分位函數

q<name>通常用來求取對應到某一機率的臨界值,如果隨機變數X ~ Normal[mu, sd]qnorm(p=0.8, mu, sd)會傳回Pr[X < x1] = 0.8的臨界值x1

qnorm(p=0.8, 100, 20)
[1] 116.83

🧙 問題討論:
A. 如果隨機變數X ~ Normal[mu=100, sd=20],請求X的90%信賴區間
B. 請畫出圖形來解釋你的運算




【C】實證分布的應用 Empirical Distribution

CASE STUDY: THE OLD FAITHFUL


【D】理論分布 Theoretical Distribution

D1. Binomial vs. Normal Distribution

💡 : When \(n\) is large, \(Binom[n, p]\) approaches \(Norm[\mu = n p, \sigma=\sqrt{n p (1-p)}]\)

💡 : \(X \sim Binom[n, p] \, \Rightarrow \, Exp(X) = n \cdot p \, , \, Var(X) = n \cdot p \cdot (1-p)\)

par(mfrow=c(1,1), cex=0.7)
n = 1000; p = 0.2
rbinom(500000, n, p) %>% hist(breaks=80, freq=F)
curve(dnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), col='red', lwd=2, add=T)

💡 : When \(n \times p\) is not large enough, Binomial degenerates into a discrete distribution.

par(mfrow=c(1,2), cex=0.7)
n = 10; p = 0.2
rbinom(100000, n, p) %>% hist(freq=F, breaks=(0:n)-0.01)
rnorm(100000, n*p, sqrt(n*p*(1-p))) %>% hist(freq=F)



D2. Binomial vs. Poisson Distribution

💡 : When \(n\) is large and \(p\) is small, \(Binom[n, p]\) approaches \(Pois[\lambda = n p]\)

par(mfrow=c(1,2), cex=0.7)
rbinom(100000, 1000, 0.002) %>% table %>% barplot(main="Boinomial")
rpois(100000, 2)  %>% table %>% barplot(main="Poisson")

💡 : \(X \sim Pois[\lambda] \, \Rightarrow \, E(X) = Var(X) = \lambda\)

💡 : \(X \sim Pois[\lambda_1], \, Y \sim Pois[\lambda_2] \, \Rightarrow \, X+Y \sim Pois[\lambda_1 + \lambda_2]\)

par(mfrow=c(1,2), cex=0.7)
(rpois(100000, 1) + rpois(100000, 2)) %>% table %>% barplot(main="Pois[1] + Pois[2]")
rpois(100000, 3)  %>% table %>% barplot(main="Pois[3]")

dpois(0:10, 3) + dpois(0:10, 4) - dpois(0:10, 7)
 [1]  0.0671908  0.2162406  0.3482258  0.3672794  0.2721720  0.1293956
 [7]  0.0056023 -0.0678584 -0.0925057 -0.0854730 -0.0648806



D3. Geometric Dist. - the Dist. of Waiting Time

💡 : We can simulating Geometrics Dis. by Binomial Dist.

par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1) )
replicate(100000, which(rbinom(100, 1, .2) == 1)[1] - 1) %>% 
  table %>% barplot(main="Binomial Simulation")
rgeom(100000, 0.2) %>% table %>% barplot(main="Geometric")

💡 : \(X \sim Geom[p] \, \Rightarrow \, E[X] = \frac{1}{p}-1\)

If a machine has a probabaility of being broken in any given day, what is the probability that this machine is still workng in 30 days.

par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1))
pgeom(0:29, prob=0.1) %>% 
  barplot(names=1:30, main="broke by the end of the #th day")
(1- pgeom(0:29, prob=0.1)) %>% 
  barplot(names=1:30, main="still working by the end of the #th day")



【E】理論分布的實證應用

E1. Death by Horse Kick

The Data

par(mfrow=c(1,1), cex=0.7)
(HK = HorseKicks)
nDeaths
  0   1   2   3   4 
109  65  22   3   1 

Does it comply to Poisson Distribution?

fit = goodfit(HK, type = "poisson")
summary(fit)

     Goodness-of-fit test for poisson distribution

                     X^2 df P(> X^2)
Likelihood Ratio 0.86822  3  0.83309

What is the \(\lambda\)?

fit$par
$lambda
[1] 0.61

What is the probability of nDeath >= 2?

1 - ppois(1, fit$par$lambda)
[1] 0.12521


🧙 問題討論:
如果保險公司想要為國防部設計一個被馬踢死的保險:
  ■ 如果你只要只靠HorseKick這一份數據,每一軍團每年被馬踢死的次數超過5次的機率是多?
  ■ 如果我們將數據fit到理論分布上面,根據理論分佈,被馬踢死的次數超過5次的機率是多?
  ■ 以上哪一種做法才是比較合理的做法的?

What is the probability of nDeath >= 5?

1 - ppois(4, fit$par$lambda)
[1] 0.00042497



E2. “May” in Federalist Papers
(Fed = Federalist)
nMay
  0   1   2   3   4   5   6 
156  63  29   8   4   1   1 

Does it comply to Poisson Distribution?

fit <- goodfit(Fed, type = "poisson")
summary(fit)

     Goodness-of-fit test for poisson distribution

                    X^2 df   P(> X^2)
Likelihood Ratio 25.243  5 0.00012505

Does it comply to Negtive Binomial Distribution?

fit = goodfit(Fed, type = "nbinomial")
summary(fit)

     Goodness-of-fit test for nbinomial distribution

                   X^2 df P(> X^2)
Likelihood Ratio 1.964  4  0.74238

What are the parameters?

fit$par
$size
[1] 1.1863

$prob
[1] 0.64376

How does the distribution looks like?

dnbinom(0:10, fit$par$size, fit$par$prob)
 [1] 0.593037483 0.250629638 0.097602955 0.036929829 0.013768785 0.005087807
 [7] 0.001868776 0.000683457 0.000249147 0.000090594 0.000032875

How is the probability that 2 <= nMay <= 6?

pnbinom(6, fit$par$size, fit$par$prob) - pnbinom(1, fit$par$size, fit$par$prob)
[1] 0.15526


🧙 問題討論:
cup.csv檔案裡面是一千個信徒的擲筊次數,假定每一個人都是擲到3次成功才停止,請問:
  ■ 這個筊的成功機率大約是?
  ■ 請畫出用這個筊擲10次成功之前,失敗次數的機率分布?
  ■ 用這個筊擲15次還不能有5次成功的機率是?