單元摘要:使用R複習並練習操作「基礎統計」
pacman::p_load(magrittr, vcd)
統計量(statistics)就是一群數值的摘要(summary),常用的統計量有:
mean()
median()
max()
min()
sd()
平均誤差var()
平均誤差的平方quantile()
🧙 : 在標準差和變異數的定義裡面,所謂誤差是指對什麼的誤差呢?
我們先隨機在0到100之間產生50個的亂數
X = runif(50, 0, 100)
X
[1] 67.0389 1.4145 71.2882 49.0196 86.7699 91.6683 20.2658 74.3490 2.5239
[10] 87.5861 60.9242 43.2304 89.3586 8.6413 27.6843 26.5986 86.0888 39.4637
[19] 26.3993 42.4937 71.6155 3.2579 58.9096 99.9434 73.7333 79.2971 69.3592
[28] 69.0265 73.6900 89.8273 9.8699 2.3447 1.4221 16.2632 70.4671 61.6363
[37] 97.3049 31.8472 22.3043 83.5507 79.3931 80.4565 60.7377 36.7453 26.0460
[46] 37.0256 51.9973 20.9974 86.5447 73.9763
summary(X) # 基本統計量
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.42 26.45 60.83 52.85 78.06 99.94
c(sd(X), var(X)) # 標準差、變異數
[1] 30.282 917.022
quantile(X, seq(0,1,1/10)) # 百分位數
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
1.4145 8.1029 22.0429 30.5983 42.9357 60.8310 69.8024 73.8062 81.0754 87.7634
100%
99.9434
除了這些統計量之外,我們也可以利用圖形來幫助我們(對於這一群數字)建立一個整體性的認識,直方圖可以顯示變數的值所出現的頻率(次數)或機率(密度)
par(mfrow=c(1,2), cex=0.8)
hist(X, 10, col='aliceblue', border='lightgray') # frquency
hist(X, 10, col='aliceblue', border='lightgray', freq=F) # density
R有一群內建的機率分布
dbeta()
dbinom()
dcauchy()
dchisq()
dexp()
df()
dgamma()
dhyper()
dlnorm()
dmultinom()
dnbinom()
dnorm()
dpois()
dt()
dunif()
dweibull()
?distributions
每一種機率分布都有四個功能函數:
d<name>()
: density or probability functionp<name>()
: cumulative density functionq<name>()
: quantile functionr<name>()
: random samplingpar(mfrow=c(2,3), mar=c(2,4,3,2), cex=0.6)
for(n in c(10, 20)) for(p in c(0.2, 0.5, 0.8))
dbinom(0:n, n, p) %>%
barplot(names=0:n, main=sprintf("Binom[n=%d, p=%.1f]", n, p), las=2)
d<name>: d for density
機率密度函數 (PDF)d<name>
用來求取對應到某一數值的機率密度, 給定機率分佈的種類和它所需要的「參數」,我們就能確定「值域」之中每一點的機率密度; d<name>
通常用來畫機率密度函數
par(mfrow=c(2,2), mar=c(2,4,3,2), cex=0.7)
curve(dnorm(x,mean=50,sd=15),0,100,main="Normal")
curve(dunif(x,min=0,max=100),0,100,main="Uniform")
curve(dlnorm(x,meanlog=log(50),sdlog=1.5),0,100,main="Log-Normal")
curve(dexp(x,rate=0.05),0,100,main="Exponential")
傳回Normal[mean=40,sd=15]
這個機率密度函數在x = 40
這一點的值
dnorm(x=40, mean=50, sd=15)
[1] 0.021297
傳回Normal[mean=40,sd=15]
這個機率密度函數在x = c(0,20,40,60,80,100)
這些點的值
dnorm(seq(0,100,20), mean=50, sd=15)
[1] 0.00010282 0.00359940 0.02129653 0.02129653 0.00359940 0.00010282
畫出機率分布的形狀
par(mfrow=c(1,2), mar=c(2,2,1,2),cex=0.7)
x = seq(0,100,1)
y = dnorm(x, mean=50, sd=15)
plot(x, y, type='l')
# curve() is a easier way to plot a function (or expresion) of x
curve(dnorm(x,mean=50,sd=15),0,100)
r<name>: r for random
隨機抽樣r<name>
常用來從給定的機率分佈之中隨機抽出樣本
norm50 = rnorm(n=50, mean=50, sd=10)
unif50 = runif(n=50, min=0, max=100)
par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.6)
hist(norm50,breaks=10)
hist(unif50,breaks=10)
par(mfrow=c(3,2), mar=c(3,2,2,2),cex=0.6)
for(n in c(20, 200, 2000)) {
hist(rnorm(n, mean=50, sd=10),breaks=10,main=paste("Normal",n))
hist(runif(n, min=0, max=100),breaks=10,main=paste("Uniform",n))
}
p<name>: p for probability
累積機率函數 (CDF)p<name>
求取變數落在某數值區間的機率,給定機率分佈,我們常會想要知道目標變數會落在某一個數值空間的機率,如果隨機變數X ~ Normal[mu=50, sd=10]
,pnorm(x=55, 50, 10)
會傳回X < 55
的機率;
mu=50; sd=10
pnorm(55, mu, sd)
[1] 0.69146
🧙 問題討論:
A. 如果隨機變數X ~ Normal[mu=100, sd=20]
,請求出以下機率:
■ P[X < 110]
■ P[130 > X > 110]
■ P[X > 130]
B. 請畫出圖形來解釋你的運算
x1 = 90; x2 = 130
p1 = pnorm(x1,100,20); p2 = pnorm(x2,100,20)
c(p1, p2 - p1, 1 - p2)
[1] 0.308538 0.624655 0.066807
par(mfrow=c(1,2), mar=c(2,2,2,2),cex=0.7)
curve(dnorm(x,100,20),40,160,main="PDF")
abline(v=c(x1,x2),col='gray',lty=3)
x = seq(x1,x2,length=100)
polygon(c(x, x2, x1), c(dnorm(x,100,20), 0, 0), col="#00E9001F", border=NA)
#
curve(pnorm(x,100,20),40,160,main="CDF")
abline(v=c(x1,x2),col='gray',lty=3)
abline(h=c(p1,p2),col='pink')
q<name>: q for quantile
百分位函數q<name>
通常用來求取對應到某一機率的臨界值,如果隨機變數X ~ Normal[mu, sd]
,qnorm(p=0.8, mu, sd)
會傳回Pr[X < x1] = 0.8
的臨界值x1
;
qnorm(p=0.8, 100, 20)
[1] 116.83
🧙 問題討論:
A. 如果隨機變數X ~ Normal[mu=100, sd=20]
,請求X
的90%信賴區間
B. 請畫出圖形來解釋你的運算
💡 : When \(n\) is large, \(Binom[n, p]\) approaches \(Norm[\mu = n p, \sigma=\sqrt{n p (1-p)}]\)
💡 : \(X \sim Binom[n, p] \, \Rightarrow \, Exp(X) = n \cdot p \, , \, Var(X) = n \cdot p \cdot (1-p)\)
par(mfrow=c(1,1), cex=0.7)
n = 1000; p = 0.2
rbinom(500000, n, p) %>% hist(breaks=80, freq=F)
curve(dnorm(x, mean=n*p, sd=sqrt(n*p*(1-p))), col='red', lwd=2, add=T)
💡 : When \(n \times p\) is not large enough, Binomial degenerates into a discrete distribution.
par(mfrow=c(1,2), cex=0.7)
n = 10; p = 0.2
rbinom(100000, n, p) %>% hist(freq=F, breaks=(0:n)-0.01)
rnorm(100000, n*p, sqrt(n*p*(1-p))) %>% hist(freq=F)
💡 : When \(n\) is large and \(p\) is small, \(Binom[n, p]\) approaches \(Pois[\lambda = n p]\)
par(mfrow=c(1,2), cex=0.7)
rbinom(100000, 1000, 0.002) %>% table %>% barplot(main="Boinomial")
rpois(100000, 2) %>% table %>% barplot(main="Poisson")
💡 : \(X \sim Pois[\lambda] \, \Rightarrow \, E(X) = Var(X) = \lambda\)
💡 : \(X \sim Pois[\lambda_1], \, Y \sim Pois[\lambda_2] \, \Rightarrow \, X+Y \sim Pois[\lambda_1 + \lambda_2]\)
par(mfrow=c(1,2), cex=0.7)
(rpois(100000, 1) + rpois(100000, 2)) %>% table %>% barplot(main="Pois[1] + Pois[2]")
rpois(100000, 3) %>% table %>% barplot(main="Pois[3]")
dpois(0:10, 3) + dpois(0:10, 4) - dpois(0:10, 7)
[1] 0.0671908 0.2162406 0.3482258 0.3672794 0.2721720 0.1293956
[7] 0.0056023 -0.0678584 -0.0925057 -0.0854730 -0.0648806
💡 : We can simulating Geometrics Dis. by Binomial Dist.
par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1) )
replicate(100000, which(rbinom(100, 1, .2) == 1)[1] - 1) %>%
table %>% barplot(main="Binomial Simulation")
rgeom(100000, 0.2) %>% table %>% barplot(main="Geometric")
💡 : \(X \sim Geom[p] \, \Rightarrow \, E[X] = \frac{1}{p}-1\)
If a machine has a probabaility of being broken in any given day, what is the probability that this machine is still workng in 30 days.
par(mfrow=c(2,1), cex=0.7, mar=c(3,3,3,1))
pgeom(0:29, prob=0.1) %>%
barplot(names=1:30, main="broke by the end of the #th day")
(1- pgeom(0:29, prob=0.1)) %>%
barplot(names=1:30, main="still working by the end of the #th day")
The Data
par(mfrow=c(1,1), cex=0.7)
(HK = HorseKicks)
nDeaths
0 1 2 3 4
109 65 22 3 1
Does it comply to Poisson Distribution?
fit = goodfit(HK, type = "poisson")
summary(fit)
Goodness-of-fit test for poisson distribution
X^2 df P(> X^2)
Likelihood Ratio 0.86822 3 0.83309
What is the \(\lambda\)?
fit$par
$lambda
[1] 0.61
What is the probability of nDeath >= 2
?
1 - ppois(1, fit$par$lambda)
[1] 0.12521
🧙 問題討論:
如果保險公司想要為國防部設計一個被馬踢死的保險:
■ 如果你只要只靠HorseKick
這一份數據,每一軍團每年被馬踢死的次數超過5次的機率是多?
■ 如果我們將數據fit到理論分布上面,根據理論分佈,被馬踢死的次數超過5次的機率是多?
■ 以上哪一種做法才是比較合理的做法的?
What is the probability of nDeath >= 5
?
1 - ppois(4, fit$par$lambda)
[1] 0.00042497
(Fed = Federalist)
nMay
0 1 2 3 4 5 6
156 63 29 8 4 1 1
Does it comply to Poisson Distribution?
fit <- goodfit(Fed, type = "poisson")
summary(fit)
Goodness-of-fit test for poisson distribution
X^2 df P(> X^2)
Likelihood Ratio 25.243 5 0.00012505
Does it comply to Negtive Binomial Distribution?
fit = goodfit(Fed, type = "nbinomial")
summary(fit)
Goodness-of-fit test for nbinomial distribution
X^2 df P(> X^2)
Likelihood Ratio 1.964 4 0.74238
What are the parameters?
fit$par
$size
[1] 1.1863
$prob
[1] 0.64376
How does the distribution looks like?
dnbinom(0:10, fit$par$size, fit$par$prob)
[1] 0.593037483 0.250629638 0.097602955 0.036929829 0.013768785 0.005087807
[7] 0.001868776 0.000683457 0.000249147 0.000090594 0.000032875
How is the probability that 2 <= nMay <= 6
?
pnbinom(6, fit$par$size, fit$par$prob) - pnbinom(1, fit$par$size, fit$par$prob)
[1] 0.15526
🧙 問題討論:
在cup.csv
檔案裡面是一千個信徒的擲筊次數,假定每一個人都是擲到3次成功才停止,請問:
■ 這個筊的成功機率大約是?
■ 請畫出用這個筊擲10次成功之前,失敗次數的機率分布?
■ 用這個筊擲15次還不能有5次成功的機率是?