of a random experiment. (The

(1)

http://www.hmwu.idv.tw

吳漢銘國立政治大學統計學系

機率分佈

(大數法則、中央極限定理)

B02

(2)

本章大綱 & 學習目標

 排列組合

 機率分佈 (Probability distribution)

 統計分配之描述、常見之分佈(二項式分佈、常態分佈)、

隨機抽樣

 以常態機率逼近二項式機率

 實作 Laplace Distribution (d, p, q, r)

 QQplot

 資料是否來自常態分佈之檢定 (Test for Normality)

 大數法則 (LLN)

 中央極限定理 (CLT)

 用R程式模擬算機率

2/61

(3)

常見統計名詞

 A random experiment (隨機實驗) is a process by which we observe something uncertain. After the experiment, the result of the random experiment is known.

 Outcome (結果): An outcome is a result of a random experiment.

 Sample space (樣本空間), S : the set of all possible outcomes.

 Event (事件), E : an event is a subset of the sample space.

 Trial (試驗): a single performance of an experiment whose outcome is in S .

 In the experiment of tossing 4 coins, we may consider tossing each coin as a trial and therefore say that there are 4 trials in the experiment.

 例子1: 投擲兩硬幣看看正反面之樣本空間 S={HH, HT, TH, TT}.

 例子2: In the context of an experiment, we may define the sample space of observing a person as S = {sick, healthy, dead} . The following are all events: {sick} , {healthy} , {dead} , {sick, healthy} , {sick, dead} , {healthy, dead} , {sick, healthy, dead} , {none of the above} .

3/61

(4)

機率與隨機變數

 Probability (機率): the probability of event E, P(E), is the value

approached by the relative frequency of occurrences of E in a

long series of replications

of a random experiment. (The

frequentist view)

 Random variable (隨機變數): A function that assigns real

numbers to events, including the null event.

4/61

Source: Statistics and Data with R

(5)

排列數與組合數

 combn{utils}

: Generate All Combinations of

n

Elements, Taken

m

at a Time.



Usage:

combn(x, m, FUN = NULL, simplify = TRUE, ...)



Arguments:

 x: vector source for combinations, or integer n for x <- seq_len(n).

 m: number of elements to choose.

 FUN: function to be applied to each combination; default NULL means the identity, i.e., to return the combination (vector of length m).

5/61

> x <- letters[1:3]

> library(combinat)

> permn(x) [[1]]

[1] "a" "b" "c"

[[2]]

[1] "a" "c" "b"

[[3]]

[1] "c" "a" "b"

[[4]]

[1] "c" "b" "a"

[[5]]

[1] "b" "c" "a"

[[6]]

[1] "b" "a" "c"

> # the number of distinct arrangements of 3 out of 5

> nCm(5, 3) [1] 10

> combn(5, 3)

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 1 1 1 1 1 2 2 2 3 [2,] 2 2 2 3 3 4 3 3 4 4 [3,] 3 4 5 4 5 5 4 5 5 5

> combn(5, 3, min)

[1] 1 1 1 1 1 1 2 2 2 3

> choose(5, 3) [1] 10

(6)

練習1: 算大樂透中獎機率 ^6/61

資料來源: http://www.taiwanlottery.com.tw

> sample(1:49, 6, replace = FALSE) [1] 14 45 36 25 38 28

電腦選號

> set.seed(12345)

(7)

中頭獎機率 ^7/61

資料來源: http://www.taiwanlottery.com.tw

(8)

算一下中獎機率 ^8/61

> 1 / choose(49, 6) [1] 7.151124e-08

> choose(6, 5) / choose(49, 6) [1] 4.290674e-07

> (choose(6, 5)*choose(49-6-1, 1)) / choose(49, 6) [1] 1.802083e-05

你知道嗎?

• 若1注50元，要花7億才可買遍所有號碼組合。

• 被雷擊幾率幾何？因地而異

(1) 全世界每年因雷擊造成的傷亡人數超過1萬，

按世界人口數量為70億計算，雷擊機率大約為70 萬分之一。

(2) 美聯邦應急管理局估計當前美國人平均遭雷擊的機率為60萬分之一，

(3) 中國國際防雷論壇公布中國人遭雷擊的機率大約為33萬分之一。

(9)

統計分配 (Statistical Distributions)

Four fundamental items can be calculated for a statistical distribution:

 機率密度函數值(

d

)



point probability P(X=x) or probability density function f(x):

dnorm()

 累積機率函數值 (

p )



cumulative probability distribution function, F(x) = P(X<=x) :

pnorm()

 分位數 (

q

)



the quantiles of the distribution:

qnorm()

Often, we are interested in the inverse of a distribution. That is, given a probability value p, we wish to find the quantile, x, such that P (X ≤ x|θ) = p.

 隨機數 (

r

)



the random numbers generated from the distribution:

rnorm()

9/61

(10)

常用機率分配 ^10/61

以常態分佈normal為例:

機率密度(分配)函數: dnorm() 累積機率(分配)函數: pnorm() 分位數: qnorm()

隨機數: rnorm()

(11)

機率密度函數 Density (d)

 The density for a continuous distribution is a measure of the relative probability of "getting a value close to x".

 The probability of getting a value in a particular interval is the area under the corresponding part of the curve.

 For discrete distributions, the term "density" is used for the point probability, the probability of getting exactly the value x.

11/61

> x <- seq(-4, 4, 0.1)

> plot(x, dnorm(x), type="l", main="N(0,1)")

> curve(dnorm(x), from=-4, to=4)

> x <- 0:50

> plot(x, dbinom(x, size=50, prob=0.33), type="h")

> #histogram-like

0.48

(12)

Probability Mass Function 機率質量函數

12/61

https://en.wikipedia.org/wiki/Probability_distribution

S = X₁ + X₂

X₁ ~ DiscreteUniform (1, 6), n=6.

X₂ ~ DiscreteUniform(1, 6), n=6.

f(X₁ = k) = f(X₂ = k) = 1/6, k = 1,..,6.

f(S = s) = p(S = s), s=2, ..., 12.

P(S = 2) = 1/36, P(S=3)=2/36, ..., P(S=12)=1/36 P(X₁ + X₂ > 9) = 1/12 + 1/18 + 1/36 = 1/6

The probability mass function (pmf) p(S) specifies the

probability distribution for the sum S of counts from two dice.

https://en.wikipedia.org/wiki/Probability_mass_function

(13)

Probability Density Function 機率密度函數

13/61

(14)

累積機率分配函數 CDF (p)

 It is an S-shaped curve showing for any value of x, the probability of obtaining a sample value that is less than or equal to x, P(X<= x).

 The probability density is the slope of this curve (its derivative) of the cumulative probability function.

14/61

> curve(pnorm(x), -3, 3)

> arrows(-1, 0, -1, pnorm(-1), col="red")

> arrows(-1, pnorm(-1), -3, pnorm(-1), col="green")

> pnorm(-1) [1] 0.1586553

PDF CDF

-1

0.1586553

(15)

Probability Distribution

 The probability distribution is a description of a random phenomenon in terms of the probabilities of events.

 A probability distribution is a mathematical function that can be thought of as providing the probabilities of occurrence of different possible outcomes in an experiment.

 EXAMPLE: if the random variable X is used to denote the outcome of a coin toss ("the experiment"), then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails (assuming the coin is fair).

15/61

NOTE:

• The terms "probability distribution function" and "probability function" have also sometimes been used to denote the probability density function.

• "probability distribution function" may be used when the probability

distribution is defined as a function over general sets of values, or it may refer to the cumulative distribution function.

(16)

CRAN Task View:

Probability Distribution

16/61

https://cran.r-project.org/web/views/Distributions.html

Univariate Distribution Relationships:http://www.math.wm.edu/~leemis/chart/UDR/UDR.html

http://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

(17)

機率分佈在統計學中的重要性

統計改變了世界

 十九世紀初: 「機械式宇宙」的哲學觀

 二十世紀: 科學界的統計革命。

 二十一世紀: 幾乎所有的科學已經轉而運用統計模式了。

統計革命的起點

 1895-1898，發表一系列和相關性(correlation) 有關的論文，

涉及動差、相關係數、標準差、卡方適合度檢定，奠定了現代統計學的基礎。

 引入了統計模型的觀念: 如果能夠決定所觀察現象的機率分佈的參數，就可以了解所觀察現象的本質。

17/61

Schweizer, B. (1984), Distributions Are the Numbers of the Future, in Proceedings of The Mathematics of Fuzzy Systems Meeting, eds. A. di Nola and A. Ventre, Naples, Italy: University of Naples, 137–149.

(The present is that future.)

(18)

常用機率分佈的應用

 Normal distribution, for a single real-valued quantity that grow linearly (e.g. errors, offsets)

 Log-normal distribution, for a single positive real-valued quantity that grow exponentially (e.g. prices, incomes, populations)

 Discrete uniform distribution, for a finite set of values (e.g. the outcome of a fair die)

 Binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences

 Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs.

 Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples.

 F-distribution, the distribution of the ratio of two scaled chi squared variables; useful e.g. for inferences that involve comparing variances or involving R-squared.

18/61

(19)

分位數 Quantiles (q)

 The quantile function is the inverse of the cumulative distribution function:

F^-1(p)= x.

 We say that q is the x%-quantile if x% of the data values are ≤ q.

19/61

> # 2.5% quantile of N(0, 1)

> qnorm(0.025) [1] -1.959964

> # the 50% quantile (the median) of N(0, 1)

> qnorm(0.5) [1] 0

> qnorm(0.975) [1] 1.959964

0.025

0.95

0.025

-1.959964 1.959964

(20)

隨機數 Random Numbers (r)

 Let X_i is a vector of measurements for the i-th object in the sample.

 (X₁, X₂,..., X_n) is said to be a random sample of size n from the common distribution if X₁, X₂,..., X_nas independent copies of an underlying

measurement vector. (an n-tuple of identically-distributed independent random variables).

20/61

> par(mfrow=c(2,2))

> hist.sym <- hist(rnorm(10000),nclas=100,freq=FALSE, + main="Symmetric Distribution", xlab="rnorm")

> hist.flat <- hist(runif(10000),nclas=100,freq=FALSE, + main="Symmetric Flat Distribution", xlab="runif")

> hist.skr <- hist(rgamma(10000,shape=2,scale=1),freq=FALSE, nclas=100, + main="Skewed to Right", xlab="rgamma")

> hist.skl <- hist(rbeta(10000,8,2),nclas=100,freq=FALSE, + main="Skewed to Left", xlab="rbeta")

(21)

隨機抽樣 (Random Sampling)



The concepts of randomness and probability are central to statistics.

> sample(x, size, replace = FALSE, prob = NULL)



sampling without replacement

> sample(1:40, 5) [1] 12 38 2 3 7



sampling with replacement

> sample(1:40, 5, replace=TRUE) [1] 35 4 4 16 22



Simulate 10 coin tosses (fair coin-tossing)

> sample(c("H", "T"), 10, replace=T)

[1] "T" "T" "T" "H" "H" "H" "T" "H" "T" "H"

> sample(c("succ", "fail"), 10, replace=T, prob=c(0.9, 0.1))

[1] "succ" "succ" "succ" "fail" "fail" "fail" "succ" "succ" "succ" "succ"

21/61

(22)

隨機抽樣 (Random Sampling)

> x <- 1:5

> sample(x) # permutation [1] 3 1 5 4 2

 Clinical trials: randomization: random assign to two groups, total 20 subjects random assigning treatment groups

> sample(2, size=20, replace=TRUE)

[1] 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2 1 2 1 1 1

 random choose 10 subjects to group 1

> sample(20, size=10, replace=FALSE) [1] 10 13 16 8 4 14 7 11 1 5

22/61

> x <- 1:10

> x[x > 8]

[1] 9 10

> sample(x[x > 8]) # length 2 [1] 10 9

> x[x > 9]

[1] 10

> sample(x[x > 9]) # length 10

[1] 7 9 6 4 8 5 3 2 1 10

(23)

二項式分佈 (Binomial)

 X~B(n, p)表示n次伯努利試驗中(size)，成功結果出現的次數。

 例: 擲一枚公正銅板十次，那麼擲得正面次數就服從n = 10、p = 1/2 的二項分布。

 dbinom(x, size, prob) # 機率公式值 ^P(X=x)

 pbinom(q, size, prob) # 累加至^q的機率值 P(X <= q)

 qbinom(p, size, prob) # 已知累加機率值，對應的機率點。

 rbinom(n, size, prob) # 隨機樣本數⁼ⁿ的二項隨機變數值。

23/61

https://en.wikipedia.org/wiki/Binomial_distribution

(24)

二項式分佈

X~B(10, 0.8)



利用二項分配理論公式，計算機率公式值 P(X=3)。

> factorial(10)/(factorial(3)*factorial(7))*0.8^3*0.2^7 [1] 0.000786432



利用R函數，計算機率值 P(X=3)。

> dbinom(3, 10, 0.8) [1] 0.000786432



計算P(X<= 3)- P(X<= 2) ，並和P(X=3)相比較。

> pbinom(3, 10, 0.8)- pbinom(2, 10, 0.8) [1] 0.000786432



已知累加機率值為0.1208，求對應的分位數。

> qbinom(0.1208, 10, 0.8) [1] 6

> pbinom(6, 10, 0.8) [1] 0.1208739

24/61

(25)

二項式分佈

X~B(10, 0.8)

 產生隨機樣本數100的二項隨機數值，計算其平均數及變異數，並與

理論值比較。

 畫直方圖，x-axis="機率值"，label="probability", title="Histogram of Binom(10, 0.8)"。

25/61

> n <- 10

> p <- 0.8

> m <- 100

> xbin <- rbinom(m, n, p)

> table(xbin) xbin

4 5 6 7 8 9 10 1 6 12 21 31 18 11

> mu <- n*p; mu [1] 8

> sigma2 <- n*p*(1-p); sigma2 [1] 1.6

> mean(xbin) [1] 7.73

> var(xbin) [1] 1.956667

> hist(xbin, ylab="probability", main="Histogram of Binom(10, 0.8)", prob=T)

(26)

常態分佈 ^26/61

https://en.wikipedia.org/wiki/Normal_distribution

 dnorm(x, mean, sd)#機率密度函數值 f(x)

 pnorm(q, mean, sd)#累加機率值P(X<= x)

 qnorm(p, mean, sd)#累加機率值p 對應的分位數

 rnorm(n, mean, sd)#常態隨機樣本

(27)

常態分佈 (Normal Distribution)

> par(mfrow=c(1,4))

> curve(dnorm, -3, 3, xlab="z", ylab="Probability density", main="Density")

> curve(pnorm, -3, 3, xlab="z", ylab="Probability", main="Probability")

> curve(qnorm, 0, 1, xlab="p", ylab="Quantile(z)", main="Quantiles")

> hist(rnorm(1000), xlab="z", ylab="frequency", main="Random numbers")

27/61

Z ~N(0, 1)

> dnorm(0)

[1] 0.3989423

> pnorm(-1) [1] 0.1586553

> qnorm(0.975) [1] 1.959964

> dnorm(10, 10, 2) # X~N(10, 4) [1] 0.1994711

> pnorm(1.96, 10, 2) [1] 2.909907e-05

> qnorm(0.975, 10, 2) [1] 13.91993

> rnorm(5, 10, 2)

[1] 9.043357 11.721717 7.763277 9.563463 10.072386

> pnorm(15, 10, 2) - pnorm(8, 10, 2) # P(8<=X<=15) [1] 0.8351351

(28)

常態分佈 ^28/61

> norm.sample <- rnorm(250)

> summary(norm.sample)

> hist(norm.sample, xlim=c(-5, 5), ylab="probability", + main="Histogram of N(0, 1)", prob=T)

> x <- seq(from=-5, to=5, length=300)

> lines(x, dnorm(x))

標出最頂點的座標

> x <- seq(from=-5, to=5, length=300)

> plot(x, dnorm(x), type="l")

> points(0, dnorm(0))

> height <- round(dnorm(0), 4); height

> text(1.5, height, paste("(0,", height, ")"))

(29)

以常態機率逼近二項式機率 ^29/61

par(mfrow = c(1, 2)) n <- 20 # 4

p <- 0.4 # 0.04 mu <- n * p

sigma <- sqrt(n * p * (1 - p)) x <- 0:n

plot(x, dbinom(x, n, p), type = 'h', lwd = 2, xlab = "x", ylab = "P(X=x)",

main = "B(20, 0.4)") z <- seq(0, n, 0.1)

lines(z, dnorm(z, mu, sigma), col = "red", lwd = 2) abline(h = 0, lwd = 2, col = "grey")

(30)

Double Exponential (Laplace)



The double exponential, also called the Laplace, is the

density of the difference between two independent rv with identical exponential densities.

30/61

# bda: Density Estimation for Grouped Data

dlap(x, mu=0, rate=1) plap(q, mu=0, rate=1) qlap(p, mu=0, rate=1) rlap(n, mu=0, rate=1)

# smoothmest: Smoothed M-estimators for 1-dimensional location

ddoublex(x, mu=0, lambda=1) rdoublex(n, mu=0, lambda=1)

# distr: Object Oriented Implementation of Distributions

D <- DExp(rate = 1) # D is a Laplace distribution with rate = 1.

r(D)(1) # one random number generated from this distribution, e.g. 0.4190765 d(D)(1) # Density of this distribution is 0.1839397 for x = 1.

p(D)(1) # Probability that x < 1 is 0.8160603.

q(D)(.1) # Probability that x < -1.609438 is 0.1.

rate(D) # rate of this distribution is 1.

rate(D) <- 2 # rate of this distribution is now 2.

(31)

實作 Laplace Distribution ^31/61

http://www.math.uah.edu/stat/special/Laplace.html ddouble.exp <- function(x, mu = 0, sigma = 1){

d <- 1 / (2 * sigma) * exp(-abs((x - mu) / sigma)) d

}

pdouble.exp <- function(x, mu = 0, sigma = 1){

p <- ifelse(x - mu < 0,

1/2 * exp((x - mu) / sigma), 1 - 1/2 * exp(-(x - mu) / sigma)) p

}

qdouble.exp <- function(p, mu = 0, sigma = 1){

q <- ifelse(p <= 0.5, mu + sigma * log(2 * p),

mu - sigma * log(2 * (1 - p))) q

}

rdouble.exp <- function(n, mu = 0, sigma = 1){

u <- runif(n)

r <- ifelse(u <= 0.5, mu + sigma * log(2 * u),

mu - sigma * log(2 * (1 - u))) r

}

r <- mu + rexp(n, 1 / sigma) * ifelse(runif(n) <= 0.5, -1, 1)

(32)

實作 Laplace Distribution ^32/61

> set.seed(12345)

> mu <- 0; sigma <- 1

> x <- sort(rdouble.exp(n=500, mu=mu, sigma=sigma))

> par(mfrow = c(1, 2))

> # plot pdf

> hist(x, ylim = c(0, 0.5), freq=FALSE, main="Histogram of rdouble.exp")

> lines(x, ddouble.exp(x, mu=mu, sigma=sigma), col="red", lty=2, lwd=2)

> (mu.hat <- mean(x)) # [1] 0.1072196

> (sigma.hat <- sum(abs(x - mu.hat))/ length(x)) # [1] 1.009184

> # plot cdf

> F <- ecdf(x)

> plot(x, F, type="l", main="Empirical CDF of rdouble.exp")

> curve(pdouble.exp(x), col = "red", lty = 2, lwd = 2, add = TRUE) https://en.wikipedia.org/wiki/Laplace_distribution

Probability density function Cumulative distribution function

(33)

Quantile-Quantile Plots :

Testing for Normality Graphically

 The quantile-quantile (Q-Q) plot is used to determine if two data sets come from populations with a common density.

 Q-Q plots are sometimes called probability plots, especially when data are examined against a theoretical density.

 qqnorm(): produces a normal QQ plot of the values in sample

 qqline(): adds a line which passes through the first and third quartiles.

 Use the diagonal line would not make sense because the first axis is scaled in terms of the theoretical quantiles of a N (0,1) distribution.

 Using the first and third quartiles to set the line gives a robust approach for

estimating the parameters of the normal distribution, when compared with using the empirical mean and variance, say.

 Departures from the line (except in the tails) are indicative of a lack of normality.

 qqplot(): qqplot produces a QQ plot of two datasets.

33/61

(34)

qqnorm, qqline, qqplot ^34/61

> par(mfrow = c(1, 2))

> set.seed(12345);

> n <- 100; mu <- 0.5; sigma <- 0.15

> x <- rnorm(n, mu, sigma)

> hist(x, freq=FALSE, ylim=c(0, 3), main="")

> y <- seq(0, 1, length = n)

> lines(y, dnorm(y, mu, sigma), type = 'l')

> qqnorm(x, main = "rnorm(mu=0.5, sigma=0.15)");

> qqline(x)

> qqplot(x, rnorm(300))

> qqline(x, col = 2)

> qqplot(scale(x), rnorm(300))

> qqline(scale(x), col = 2)

(35)

The values (Y-axis) are then plotted against the normal quantiles (X-axis).

Approximately normal distribution.

Less variance than expected (short tails). While this distribution differs from the normal, it seldom presents any problems in statistical calculations.

More variance than you would expect in a normal distribution (long tails).

Left skew in the distribution. Right skew in the distribution. Outlier. If the outliers are due to known errors, they should be removed from the data before a more detailed analysis is performed.

qqnorm for Various Shapes of Distributions

^35/61

Source: https://docs.tibco.com/pub/spotfire/6.5.2/doc/html/prd/prd_available_diagnostic_visualizations.htm

(36)

課堂練習 ^36/61

par(mfcol=c(2,4)) n <- 100

x <- rcauchy(n, 0, 0.1) # a long tail density hist(x, freq=FALSE, main="long tail")

curve(dnorm(x, mean=mean(x), sd=sd(x)), add=TRUE, col="red") qqnorm(x, main = "rcauchy(n, 0, 1)");

qqline(x, col="red")

# different shapes of distributions

x <- rbeta(n, 2, 2) # a short tail density x <- rbeta(n, 2, 20) # a right skewed density x <- rbeta(n, 20, 2) # a left skewed density

Try

• Normal

• Symmetric, Non-Normal, Short-Tailed

• Symmetric, Non-Normal, Long-Tailed

• Symmetric and Bimodal

• Bimodal Mixture of 2 Normals

• Skewed (Non-Symmetric) Right

• Skewed (Non-Symmetric) Left

• Symmetric with Outlier

(37)

實作 Quantile-Quantile Plots

1. 計算樣本平均數及樣本變異數。

2. 將隨機樣本標準化並排序。

3. 查出 n個標準常態值: (將標準態分配，區分成n+1區塊，最左及最右區塊的機率分別為 1/2n, 中間的 n-1 區塊, 機率分別為1/n)。

4. 畫散佈圖: x軸: 排序的標準化樣本，y軸: 標準常態值。

5. 加入一條由(q(i), q(i))產生的標準常態直線。

(或加入一條通過第20%及第75%quantiles的直線)

37/61

(38)

課堂練習

> qqnorm(iris[,1])

> qqline(iris[,1])

> qqnorm(scale(iris[,1]))

> qqline(scale(iris[,1]))

> my.qqplot(iris[,1])

38/61

my.qqplot <- function(x){

x.mean <- mean(x) x.var <- var(x) n <- length(x)

z <- (x-x.mean)/sqrt(x.var) z.mean <- mean(z)

z.var <- var(z) z.sort <- sort(z) k <- 1:n

p <- (k-0.5)/n q <- qnorm(p)

plot(q, z.sort, xlim=c(-3, 3), ylim=c(-3, 3)) title("QQ plot")

lines(q, q, col=2) }

(39)

Formal Tests for Normality

 The hypotheses used are:

H₀: The sample data are not significantly different than a normal population.

H_a: The sample data are significantly different than a normal population

39/61

> par(mfrow=c(1, 2))

> hist(iris$Sepal.Width)

> qqnorm(iris$Sepal.Width)

> qqline(iris$Sepal.Width, col="red")

Packages: nortest Five omnibus tests for testing the composite hypothesis of normality:

ad.test, cvm.test, lillie.test,

pearson.test, sf.test

(40)

ks.test, ad.test, shapiro.test

^40/61



Kolmogorov-Smirnov (K-S) test (Chakravarti et al., 1967).



The Anderson-Darling test (Stephens, 1974).



The Shapiro-Wilk normality test (Shapiro and Wilk, 1965).



A large p -value (larger than, say, 0.05) indicates that the sample is not different from normal with the sample's mean and standard deviation.

> x <- iris$Sepal.Width

> ks.test(x, 'pnorm', mean(x), sd(x))

One-sample Kolmogorov-Smirnov test data: x

D = 0.10566, p-value = 0.07023 alternative hypothesis: two-sided Warning message:

In ks.test(x, "pnorm", mean(x), sd(x)) :

ties should not be present for the Kolmogorov-Smirnov test

> library(nortest)

> ad.test(iris$Sepal.Width)

Anderson-Darling normality test data: iris$Sepal.Width

A = 0.90796, p-value = 0.02023

> shapiro.test(iris$Sepal.Width) Shapiro-Wilk normality test data: iris$Sepal.Width

W = 0.98492, p-value = 0.1012

(41)

Which Normality Test Should I Use?

 Kolmogorov-Smirnov test:

 The test applies to continuous densities only.

 It is more sensitive near the center of the density than at the tails than other tests;

 For data sets n > 50.

 The Anderson-Darling test:

 A-D test is a modification of the K-S test and gives more weight to the tails of the density than does the K-S test. It is generally preferable to the K-S test.

 Shapiro-Wilks test:

 Doesn't work well if several values in the data set are the same.

 Works best for data sets with n < 50, but can be used with larger data sets.

 W/S test (range(x)/sd(x)): simple, but effective.

 Jarque-Bera test (jarque.test {moments}): tests for skewness and kurtosis, very effective.

 D'Agostino test (agostino.test{moments}) : powerful omnibus (skewness, kurtosis, centrality) test.

41/61

(42)

Which Normality Test Should I Use?



Asghar Ghasemi and Saleh Zahediasl, Normality Tests for

Statistical Analysis: A Guide for Non-Statisticians, Int J Endocrinol

Metab. 2012 Spring; 10(2): 486–489.

 assessing the normality assumption should be taken into account for using parametric statistical tests.

 The K-S test, should no longer be used owing to its low power.

 It is preferable that normality be assessed both visually and through normality tests, of which the Shapiro-Wilk test is highly

recommended.

 NOTE:

 If the data are not normal, use non-parametric tests.

 If the data are normal, use parametric tests.

 If you have groups of data, you MUST test each group for normality.

 It's common seen that a model is built from the training data and is then applied to the testing data. Did these two data sets follow the same distribution?

42/61

(43)

fitdistrplus :

An R Package for Fitting Distributions

43/61

> install.packages("fitdistrplus")

> library(fitdistrplus)

> x <- iris$Sepal.Length

> descdist(x, boot=1000) summary statistics

---

min: 4.3 max: 7.9 median: 5.8

mean: 5.843333

estimated sd: 0.8280661

estimated skewness: 0.314911 estimated kurtosis: 2.447936

> fit.n <- fitdist(x, "norm")

> summary(fit.n)

Fitting of the distribution ' norm ' by maximum likelihood

Parameters :

estimate Std. Error mean 5.8433333 0.06738557 sd 0.8253013 0.04764848 Loglikelihood: -184.0398 AIC: 372.0795 BIC: 378.1008 Correlation matrix:

mean sd mean 1 0

sd 0 1

# rapidFitFun {qAnalyst}: Function to obtain rapid fitting of multiple distributions

https://cran.r-project.org/web/packages/qAnalyst/index.html Package ‘qAnalyst’was removedfrom the CRAN repository.

Formerly available versions can be obtained from the archive.

(44)

fitdistrplus :

An R Package for Fitting Distributions

44/61

> fit.w <- fitdist(x, "weibull")

> fit.g <- fitdist(x, "gamma")

> par(mfrow=c(1 ,4))

> plot.legend <- c("normal", "weibull", "gamma")

> denscomp(list(fit.n, fit.w, fit.g), legendtext=plot.legend)

> qqcomp(list(fit.n, fit.w, fit.g), legendtext=plot.legend)

> cdfcomp(list(fit.n, fit.w, fit.g), legendtext=plot.legend)

> ppcomp(list(fit.n, fit.w, fit.g), legendtext=plot.legend)

> summary(fit.w)

Fitting of the distribution 'weibull' by maximum likelihood Parameters :

estimate Std. Error shape 7.454379 0.45168136 scale 6.208005 0.07209406 Loglikelihood: -190.7689 AIC: 385.5377 BIC: 391.559 Correlation matrix:

shape scale shape 1.0000000 0.3323758 scale 0.3323758 1.0000000

> summary(fit.g)

Fitting of the distribution 'gamma' by maximum likelihood Parameters :

estimate Std. Error shape 50.634073 5.827566 rate 8.665336 1.002253 Loglikelihood: -182.3061 AIC: 368.6122 BIC: 374.6335 Correlation matrix:

shape rate shape 1.0000000 0.9950669 rate 0.9950669 1.0000000

(45)

大數法則: The Law of Large Numbers

 由具有有限(finite)平均數μ的母體隨機抽樣，隨著樣本數

n

的增加，樣本平均數越接近母體的均數μ 。

 樣本平均數的這種行為稱為大數法則(law of large numbers)。

45/61

特別注意: 本投影片中符號n和m之區別。

(46)

Bernoulli試驗

 Bernoulli試驗(伯努利試驗): 擲一公平硬幣一次，可能出現正面或反面。

 令

X=1

為出現正面,

X=0

為出現反面。



X~Binomial(1, 0.5)

。

 伯努利分佈的平均數

p

。

46/61

rbinom(m, size=1, prob) m: number of observations (樣本數) size=1: number of trials

prob: probability of success on each trial

: 平均正面次數

m Bernoulli random samples:

rbinom(m, 1, 0.5)

(47)

利用Bernoulli試驗說明大數法則 ^47/61

sample.size <- seq(from=1, to=800, by=5) m <- length(sample.size)

xbar <- numeric(m) for(i in 1:m){

xbar[i] <- mean(rbinom(sample.size[i], 1, 0.5)) }

plot(sample.size, xbar, xlab="Number of observations, n", ylab="sample mean", main="Law of Large Numbers",

type="l", col="red", lwd=1.5) abline(h=0.5, col="blue")

(48)

中央極限定理 (Central Limit Theorem)^48/61

 由一具有平均數μ，標準差σ的母體中抽取樣本大小為n的簡單隨機樣本，當樣本大小n夠大時，樣本平均數的抽樣分配會近似於常態分配。

 在一般的統計實務上，大部分的應用中均假設當樣本大小為30(含)以上時，的抽樣分配即近似於常態分配。

 當母體為常態分配時，不論樣本大小，樣本平均數的抽樣分配仍為常態分配。

(49)

應用CLT算機率

 於某考試中，考生之通過標準機率為0.7，以隨機變數表示考生之通過與否(X=1表示通過) (X=0表示不通過)，其機率分配為 P(X=1)=0.7, P(X=0)=0.3。

1. 計算母體平均數及變異數。

2. 假如有210名考生，計算「平均通過人數」的平均數及變異數。

3. 計算通過人數> 126的機率。

49/61

1.

2.

3.

(50)

課堂練習

> z <- (126/210 -0.7)/sqrt(0.001) # 通過人數^>126的機率

> z

[1] -3.162278

> 1 - pnorm(z) [1] 0.9992173

50/61

> pass.prob <- function(x, n, mu, sigma2, digit=m){

xbar <- x/n

z <- (xbar-mu)/sqrt(sigma2) zvalue <- round(z, digit)

right.prob <- round(1-pnorm(z), digit) list(zvalue=zvalue, prob=right.prob) }

> pass.prob(126, 210, 0.7, 0.001, 4)

$zvalue [1] -3.1623

$prob

[1] 0.9992

寫一「通過人數大於某數的機率」之副程式

 n: 考生總數(n=210)

 X: 通過考生之人數, X~B(210, 0.7)

(51)

驗証中央極限定理

1. 先做隨機樣本的取樣。

2. 計算樣本平均。

3. 重復上述動作數百或數仟次，得到抽樣平均的分佈。

4. 描繒出抽樣平均之抽樣分配直方圖。

5. 畫出相對應的qqplot。

6. 再做各種不同樣本數(m₀=1, 5, 15, 30,...)的抽樣計算。

51/61

(52)

範例: Uniform Distribution ^52/61

umin <- 5 umax <- 80 n.sample <- 20 n.repeated <- 500

RandomSample <- matrix(0, n.sample, n.repeated) for(i in 1:n.repeated){

rnumber <- runif(n.sample, umin, umax) RandomSample[,i] <- as.matrix(rnumber) }

dim(RandomSample)

樣本數

重復數

(53)

抽樣樣本之直方圖 ^53/61

par(mfrow=c(2,2)) for(i in 1:4){

title <- paste(i,"-th sampling", sep="")

hist(RandomSample[,i], ylab="f(x)", xlab="random uniform", pro=T, main=title) }

(54)

抽樣樣本平均之直方圖&QQplot ^54/61

> SampleMean <- apply(RandomSample, 2, mean)

> hist(SampleMean, ylab="f(x)", xlab="sample mean", pro=T, main="n=20")

> qqnorm(SampleMean)

> qqline(SampleMean)

myfun <- function(x){

mu <- (umin + umax)/2 s2 <- (umax - umin)^2/12

z <- (mean(x) - mu)/sqrt(s2/n) z

}

https://en.wikipedia.org/wiki/Continuous_uniform_distribution

(55)

重複不同的樣本數 ^55/61

CLT.unif <- function(umin, umax, n.sample, n.repeated){

RandomSample <- matrix(0, n.sample, n.repeated) for(i in 1:n.repeated){

rnumber <- runif(n.sample, umin, umax) RandomSample[,i] <- as.matrix(rnumber) }

SampleMean <- apply(RandomSample, 2, mean) par(mfrow=c(1,2))

title <- paste("n=",n.sample, sep="")

hist(SampleMean, breaks=30, ylab="f(x)", xlab="sample mean", pro=T, main=title) qqnorm(SampleMean)

qqline(SampleMean) }

當樣本數 n 愈大時，從樣本平均數

的抽樣分配可以得到「中央極限定理」的主要結論。

CLT.unif(5, 80, 50, 500)

myfun <- function(x){

mu <- (umin + umax)/2 s2 <- (umax - umin)^2/12

z <- (mean(x) - mu)/sqrt(s2/n) z

}

(56)

CLT.unif(umin, umax, n.sample, n.repeated

^56/61

CLT.unif(5, 80, 1, 500) CLT.unif(5, 80, 5, 500)

CLT.unif(5, 80, 10, 500) CLT.unif(5, 80, 30, 500)

(57)

R package: animation ^57/61

http://cran.r-project.org/web/packages/animation/index.html http://yihui.name/animation/

A Gallery of Animations in Statistics and Utilities to Create Animations

(58)

用R程式模擬算機率: 我們要生女兒

一對夫婦計劃生孩子生到有女兒才停，或生了三個就停止。

他們會擁有女兒的機率是多少？

 第l 步：機率模型

 每一個孩子是女孩的機率是0.49 ，是男孩的機率是0.51。

各個孩子的性別是互相獨立的。

 第2 步：分配隨機數字。

 用兩個數字模擬一個孩子的性別: 00, 01, 02, …, 48 ＝女孩; 49, 50, 51, …, 99 ＝男孩

 第3 步：模擬生孩子策略

 從表A當中讀取一對一對的數字，直到這對夫婦有了女兒，或已有三個孩子。

 10次重複中，有9次生女孩。會得到女孩的機率的估計是9/10=0.9。

 如果機率模型正確的話，用數學計算會有女孩的真正機率是0.867。(我們的模

擬答案相當接近了。除非這對夫婦運氣很不好，他們應該可以成功擁有一個女兒。)

58/61

(59)

用R程式模擬算機率: 我們要生女兒 ^59/61

> girl.p <- 0.49 + 0.51*0.49 + 0.51^2*0.49

> girl.p [1] 0.867349

>

> girl.born(n=10, show.id = T) 1 : (73)男(18)女+

2 : (23)女+

3 : (53)男(74)男(64)男 4 : (95)男(20)女+

5 : (63)男(16)女+

6 : (48)女+

7 : (67)男(51)男(44)女+

8 : (74)男(99)男(25)女+

9 : (47)女+

10 : (81)男(41)女+

[1] 0.9

> girl.born(n=10000) [1] 0.8674

girl.born <- function(n, show.id = F){

girl.count <- 0 for (i in 1:n) {

if (show.id) cat(i,": ") child.count <- 0

repeat {

rn <- sample(0:99, 1, replace=T)

if (show.id) cat(paste0("(", rn, ")")) is.girl <- ifelse(rn <= 48, TRUE, FALSE) child.count <- child.count + 1

if (is.girl){

girl.count <- girl.count + 1 if (show.id) cat("女+")

break

} else if (child.count == 3) { if (show.id) cat("男") break

} else{

if (show.id) cat("男") }

}

if (show.id) cat("\n") }

p <- girl.count / n p

}

(60)

課堂練習: 模擬算機率: 腎臟移植

腎臟移植的病人資料：撐過移植手術的占90%，另外10%會死亡。在手術後存活的人中有60%移植成功，另外的40%還是得回去洗腎。五年存活率對於換了腎的人來說是70% ，對於回去洗腎的人來說是50%。計算能活過五年的機率。

 第l 步：機率模型如圖19.1。

 第2 步：對每個結果分配數字:

 階段1:

0 ＝死亡;

l, 2, 3, 4, 5, 6, 7, 8, 9 ＝存活

 階段2:

0, l, 2, 3, 4, 5 ＝移植成功 6, 7, 8, 9 ＝仍需洗腎

60/61

(61)

課堂練習: 模擬算機率: 腎臟移植

 階段3，換了腎：

0, l, 2, 3, 4, 5, 6 ＝存活五年 7, 8, 9 ＝未能存活五年

 階段3 ，洗腎：

0, l, 2, 3, 4 ＝存活五年

5, 6, 7, 8, 9 ＝未能存活五年

(階段3的數字分配，和階段2的結果有關。

所以二者間不獨立。)

 第3步：多次模擬的結果。

 在4次模擬中，有2次存活超過5年。

 經由電腦多次重覆計算，五年存活機率是0.558。

61/61

of a random experiment. (The

http://www.hmwu.idv.tw

B02

approached by the relative frequency of occurrences of E in a

of a random experiment. (The

frequentist view)

numbers to events, including the null event.

: Generate All Combinations of

Elements, Taken

at a Time.

Usage:

Arguments:

Four fundamental items can be calculated for a statistical distribution:

d

point probability P(X=x) or probability density function f(x):

p )

cumulative probability distribution function, F(x) = P(X<=x) :

q

the quantiles of the distribution:

Often, we are interested in the inverse of a distribution. That is, given a probability value p, we wish to find the quantile, x, such that P (X ≤ x|θ) = p.

r

the random numbers generated from the distribution:

Probability Mass Function 機率質量函數

Probability Density Function 機率密度函數

CRAN Task View:

Probability Distribution

The concepts of randomness and probability are central to statistics.

sampling without replacement

sampling with replacement

Simulate 10 coin tosses (fair coin-tossing)

X~B(10, 0.8)

利用二項分配理論公式，計算機率公式值 P(X=3)。

利用R函數，計算機率值 P(X=3)。

計算P(X<= 3)- P(X<= 2) ，並和P(X=3)相比較。

已知累加機率值為0.1208，求對應的分位數。

X~B(10, 0.8)

The double exponential, also called the Laplace, is the

density of the difference between two independent rv with identical exponential densities.

Quantile-Quantile Plots :

Testing for Normality Graphically

qqnorm for Various Shapes of Distributions

ks.test, ad.test, shapiro.test

Kolmogorov-Smirnov (K-S) test (Chakravarti et al., 1967).

The Anderson-Darling test (Stephens, 1974).

The Shapiro-Wilk normality test (Shapiro and Wilk, 1965).

A large p -value (larger than, say, 0.05) indicates that the sample is not different from normal with the sample's mean and standard deviation.

Which Normality Test Should I Use?

Which Normality Test Should I Use?

Asghar Ghasemi and Saleh Zahediasl, Normality Tests for

Metab. 2012 Spring; 10(2): 486–489.

fitdistrplus :

An R Package for Fitting Distributions

fitdistrplus :

An R Package for Fitting Distributions

n

X=1

X=0

X~Binomial(1, 0.5)

p

m Bernoulli random samples:

1.

2.

3.

當樣本數 n 愈大時，從樣本平均數

的抽樣分配可以得到「中央極限定 理」的主要結論。

CLT.unif(umin, umax, n.sample, n.repeated

用R程式模擬算機率: 我們要生女兒

一對夫婦計劃生孩子生到有女兒才停，或生了三個就停止。

他們會擁有女兒的機率是多少？

0 ＝ 死亡;

l, 2, 3, 4, 5, 6, 7, 8, 9 ＝ 存活

0, l, 2, 3, 4, 5 ＝移植成功 6, 7, 8, 9 ＝ 仍需洗腎

的抽樣分配可以得到「中央極限定理」的主要結論。

0 ＝死亡;

l, 2, 3, 4, 5, 6, 7, 8, 9 ＝存活

0, l, 2, 3, 4, 5 ＝移植成功 6, 7, 8, 9 ＝仍需洗腎