單元摘要: 統計概論 with R
pacman::p_load(dplyr, ggplot2, ggpubr)
💡 【抽樣分布】:樣本的統計量的機率分布
在R的工作環境裡面,我們用一個向量popX
來代表目標族群中的某一個變數X
set.seed(1234)
popX = runif(50000, 0, 100) # randomly draw 50K points from [0, 100]
用sample()
這個功能做抽樣,抽樣的結果放在另外一個向量samX
裡面
samX = sample(popX, 20)
c(pMU=mean(popX), pSD=sd(popX), sMU=mean(samX), sSD=sd(samX))
pMU pSD sMU sSD
50.034 28.848 56.092 29.553
🗿: 重複執行以上這一個程式區塊,pMU
和pSD
會改變嗎? sMU
和sSD
呢?
SampleMean = replicate(2000, mean( sample(popX, 20)))
SampleSd = replicate(2000, sd( sample(popX, 20)))
par(mfrow=c(1,2), cex=0.7)
hist(SampleMean, 20, freq=F, xlab = "樣本平均值", ylab="機率密度", main=sprintf(
"樣本平均值的分佈 (mu=%.1f, sd=%.2f)\nDistribution of Sample Means",
mean(SampleMean), sd(SampleMean)) )
hist(SampleSd, 20, freq=F, xlab = "樣本標準差", ylab="機率密度", main=sprintf(
"樣本標準差的分佈 (mu=%.1f, sd=%.2f)\nDistribution of Sample Std. Deviation",
mean(SampleMean), sd(SampleMean)) )
💡: 以上這兩個圖形就是抽樣分佈,他們分別是樣本(S
)的統計量(平均值、標準差)的分佈。
💡 【中央極限定理】:反覆從目標族群中抽取樣本,當樣本夠大、抽取次數夠多時,不論原族群呈現何種分佈,這些樣本的平均值會呈現常態分配。而且樣本平均數的平均值會趨近於族群平均值。
💡 【中央極限定理】:反覆從目標族群中抽取樣本,當樣本夠大、抽取次數夠多時,不論原族群呈現何種分佈,這些樣本的平均值會呈現常態分配。而且樣本平均數的平均值會趨近於族群平均值。
par(mfrow=c(2,3), cex=0.6, mar=c(3,3,4,2))
set.seed(2)
S = sapply(4^(1:6), function(n) {
v = replicate(2000, mean(sample(popX, n)))
hist(v, 20, freq=F, ylab="", xlab="", main=sprintf(
"Dist. Sample Means\nSampleSize:%d, mu=%.1f, sd=%.2f", n, mean(v), sd(v)))
})
💡 學習重點:
推論統計的目的:用樣本的統計量,推論族群的統計量
推論的方法:
■ 點估計:用樣本統計量推論族群統計量
■ 區間估計:透過抽樣分佈求取信心區間
假設: The Population is in \(Norm[., \sigma]\)
點估計: \(\bar{X} \rightarrow \mu\)
根據抽樣分佈 Based on Sampling Distribution:
\[ \mu \sim Norm[\bar{x}, \frac{\sigma}{\sqrt{n}}] \quad \quad (1) \]
\[ \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim Norm[0, 1] \; \Rightarrow \; P[ \bar{X}-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}, \bar{X}+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}] = 1 - \alpha \quad \quad (2) \]
sample = (iris$Sepal.Width[iris$Species=="setosa"])
n=length(sample); xbar=mean(sample); sd=sd(sample)
c(n, xbar, sd)
[1] 50.00000 3.42800 0.37906
sigma = 0.3 # known population sd
a= 0.05 # 1 minus confidence level
qnorm(c(a/2, 1-a/2), xbar, sigma/sqrt(n)) # CI in 1-a confidence lavel
[1] 3.3448 3.5112
Setosa蘭花的花萼平均寬度的95%信心區間是:CI95=[3.3448, 3.5112]
🗿 討論問題:
以上的推論結果表示:
【A】大約有95%的Setosa的花萼寬度落在CI95裡面
【B】Setosa的花萼寬度落在CI95的機率大約是95%
【C】CI95大約會涵蓋95%的Setosa的花萼寬度
【D】CI95大約有95%機率會涵蓋Setosa的花萼平均寬度
【E】Setosa的花萼的平均寬度落在CI95的機率大約是95%
【F】抽樣十萬次,大約會有5000次計算出來的CI95不涵蓋Setosa的花萼平均寬度
【G】以上皆非
假設: The Population is in \(Norm[., .]\)
點估計: \((\bar{x}, s) \rightarrow (\mu, \sigma)\)
根據抽樣分佈 Based on Sampling Distribution:
\[ \frac{\bar{X}-\mu}{s/\sqrt{n}} \sim t[n-1] \quad \quad (3) \]
sample = (iris$Sepal.Width[iris$Species=="virginica"])
n=length(sample); xbar=mean(sample); sd=sd(sample)
c(n, xbar, sd)
[1] 50.0000 2.9740 0.3225
a= 0.05 # 1 minus confidence level
# CI in 1-a confidence level
xbar + qt(c(a/2, 1-a/2), n-1) * sd/sqrt(n)
[1] 2.8823 3.0657
Virginica蘭花的花萼平均寬度的95%信心區間是:CI95=[2.7751, 3.1729]
par(mfrow=c(3, 3), mar=c(2,2,3,1), cex=0.6)
for(df in seq(2,18,2)) {
curve(dnorm(x), -5, 5, col='blue', main=paste("df =", df)) # blue: Normal
curve(dt(x, df), -5, 5, col='red', add=T) # red: t-Dist.
}
💡 自由度(樣本大小減1)稍微大一點,這兩種分佈之間的差異就很小了。
🗿 如果族群中的目標變數不是常態分佈的話要怎麼辦呢?
x = iris$Sepal.Width
bMean = replicate(2000, mean(sample(x, 5000, T)))
a = 0.05
quantile(bMean, c(a/2, 1-a/2))
2.5% 97.5%
3.0460 3.0692
🌻: Accelerated Introduction to Statistical Methods:
Bootstrap Example, Dept. Statistics, University of Wisconsin-Madison
🌻: Statistical Inference Week-4: Power, Bootstrapping, & Permutation Tests, John Hopkins University
🌻: Choosing Statistical Test, Inst. of Digital Research & Education, UCLA
hsb2 <- within(read.csv("https://stats.idre.ucla.edu/stat/data/hsb2.csv"), {
race <- as.factor(race)
schtyp <- as.factor(schtyp)
prog <- as.factor(prog)
})
attach(hsb2)
t.test(hsb2$write, mu = 50)
One Sample t-test
data: hsb2$write
t = 4.14, df = 199, p-value = 0.000051
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
51.453 54.097
sample estimates:
mean of x
52.775
\(\because\) 如果族群中write
的平均值等於50,我們會抽到『這種』樣本的機會小於 \(p=0.000051\)
\(\therefore\) 給定這個樣本,我們可以拒絕「write
的平均值等於50」的虛無假設
換句話說,我們可以推論:
■ write
的平均值等於50的機會「很小」,但是不能推論「機會小於\(p\)」
■ write
的平均值和50之間有「顯著」的差距
t.test(write ~ female, data=hsb2)
Welch Two Sample t-test
data: write by female
t = -3.66, df = 170, p-value = 0.00034
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.4992 -2.2407
sample estimates:
mean in group 0 mean in group 1
50.121 54.991
\(\because\) 如果族群中男生和女生的write
的平均值相等,我們會抽到『這種』樣本的機會小於 \(p=0.00034\)
\(\therefore\) 給定這個樣本,我們可以拒絕「男生和女生的write
的平均值相等」的虛無假設
換句話說,我們可以推論:
■ 男生和女生的write
的平均值相等的機會「很小」,但是不能推論「機會小於\(p\)」
■ 男生和女生的write
的平均值之間有「顯著」的差距
prop.test(sum(female), length(female), 0.5)
1-sample proportions test with continuity correction
data: sum(female) out of length(female), null probability 0.5
X-squared = 1.44, df = 1, p-value = 0.23
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.47330 0.61494
sample estimates:
p
0.545
\(\because\) 如果族群中女生的比例等於0.5,我們會抽到『這種』樣本的機會小於 \(p=0.23\)
\(\therefore\) 給定這個樣本,我們不能拒絕「女生的比例等於0.5」這個虛無假設
嚴格來講,我們不能拒絕虛無假設,就不應該再往前做推論,但是我們可以說:
■ 根據樣本,族群中女生的比例和0.5之間「沒有顯著」的差距
chisq.test(table(race), p = c(10, 10, 10, 70)/100)
Chi-squared test for given probabilities
data: table(race)
X-squared = 5.03, df = 3, p-value = 0.17
\(\because\) 如果族群中race
的比例是1:1:1:7
,我們會抽到『這種』樣本的機會小於 \(p=0.17\)
\(\therefore\) 給定這個樣本,我們不能拒絕「族群中race
的比例是1:1:1:7
」這個虛無假設
嚴格來講,我們不能拒絕虛無假設,就不應該再往前做推論,但是我們可以說:
■ 根據樣本,族群中race
的比例和1:1:1:7
之間「沒有顯著」的差距
給定虛無假設(mu=50
)和樣本大小(n=length(write)=200
),我們可以算出X的抽樣分佈,也就是樣本的X變數的的平均值的分布;
par(mfrow=c(1,1), cex=0.7)
x = hsb2$write; mu = 50; n = length(x)
set.seed(1)
P = runif(50000, mu-25, mu+25)
v = replicate(20000, mean(sample(P, n)))
hist(v,freq=F)
# given confidence level => critical values
q95 = quantile(v, c(0.025, 0.975)) # 0.05
q99 = quantile(v, c(0.005, 0.995)) # 0.01
abline(v=c(mu, q95, q99), col=c('blue','cyan','cyan','pink','pink'), lty=2)
# given sample statistics => p-value
points(mean(x), 0, pch=16, col='red')
rect(mean(x), -1, 56, 1, col="#FFFF003F", border=NA)
rect(100-mean(x), -1, 44, 1, col="#FFFF003F", border=NA)
根據抽樣分佈:
c(mean(write), mean(v > mean(write)), mean(v > mean(write))*2)
[1] 52.77500 0.00325 0.00650
library(ggpubr)
set.seed(1234)
wdata = data.frame(
sex = factor(rep(c("F", "M"), each=200)),
weight = c(rnorm(200, 55), rnorm(200, 55.1)))
ggboxplot(
wdata, x = "sex", y = "weight",
color = "sex", palette =c("#00AFBB", "#E7B800"),
add = "jitter") + ylim(51.5, 58.5) +
stat_compare_means(method="t.test")
ggdensity(wdata, x = "weight",
add = "mean", rug = TRUE,
color = "sex", fill = "sex",
palette = c("#00AFBB", "#E7B800"))