Random Samples and Descriptive Statistics

(1)

Statistics

Random Samples, Statistics and Sampling Distributions

Shiu-Sheng Chen

Department of Economics National Taiwan University

Fall 2019

(2)

Section 1 Random Samples and Descriptive Statistics

(3)

Random Samples

Definition (Random Samples)

A random sample with size n, {Xi}ⁿ_i=1= {X¹,X², . . . ,Xn}, is a set of i.i.d. random variables.

Random samples are also called I.I.D. samples.

Notation

{X_i}ⁿ_i=1∼i.i.d.

(µ, σ²) Properties

E(X¹) =E(X²) = ⋅ ⋅ ⋅ =E(X_n) =µ

(4)

Descriptive Statistics

Frequency Distribution Table

Empirical Density Function (Histogram) Empirical Distribution Function

Statistics and Sampling Distribution

(5)

Example: Statistics Midterm Exam

1st Midterm Exam Scores of 167 students in 2018

69 5 66 88 73 96 88 92 67 79 74 72 73 63 66 73 60 78 50 86 64 69 40 59 71 32 74 72 87 83 71 87 90 79 57 84 67 78 71 80 51 70 56 99 61 31 46 96 87 73 72 81 72 84 77 75 38 91 82 15 69 75 49 62 13 58 74 79 44 72 84 70 68 37 57 61 43 71 71 36 48 36 35 65 83 69 63 59 46 79 58 82 81 68 50 88 35 55 80 71 59 76 87 71 50 65 76 29 37 68 40 72 47 39 84 58 49 43 83 55 44 73 54 53 56 54 59 79 61 98 69 84 82 74 59 85 64 70 85 78 84 78 63 59 85 57 25

(6)

R Code

R Example (Data Loading and Frequency Distribution Table)

## 讀取資料

dat = read.csv(’2018Midterm1.csv.csv’, header=TRUE) Midterm = dat$Midterm

## 建構次數分配表

breaks = seq(0, 100, by=5)

Midterm.cut = cut(Midterm, breaks, right=FALSE) Midterm.freq = table(Midterm.cut)

Midterm.freq

(7)

Frequency Distribution Table

Midterm.cut

[0,5) [5,10) [10,15) [15,20) [20,25) [25,30) [30,35)

0 1 1 1 0 2 2

[35,40) [40,45) [45,50) [50,55) [55,60) [60,65) [65,70)

8 6 7 7 17 11 16

[70,75) [75,80) [80,85) [85,90) [90,95) [95,100)

27 17 18 13 4 6

>

(8)

Empirical Density Function

Empirical Density Function Histogram

Relative frequency distribution

(9)

R Code

R Example (Histogram)

## 繪製直方圖

hist(Midterm, breaks=10, right=FALSE, xlab=’Midterm’, main=’Histogram of Midterm’)

(10)

Empirical Density Function

Histogram of Midterm

Midterm

Frequency

0 20 40 60 80 100

010203040

(11)

Empirical Distribution Function

Definition (Empirical Distribution Function)

Given random sample {Xi}ⁿ_i=1∼^i.i.d. FX(x), the empirical distribution function (EDF) is defined as

Fˆ_n(x) = number of elements in the sample ≤x

n =

1 n

n

∑

i=1

I_{X_i_≤x}

(12)

R Code

R Example (Empirical Density Function)

## 繪製 _EDF

medf <- ecdf(Midterm)

plot(medf, main=’EDF of Midterm’)

(13)

Empirical Distribution Function

0.20.40.60.81.0

EDF of Midterm

Fn(x)

(14)

Section 2 Statistics

(15)

Statistics

Definition (Statistic)

Any function of the random sample is called a statistic:

T_n=T(X¹,X², . . . ,X_n).

A statistic does not contain unknown parameters.

The subscript n indicates the sample size.

(16)

Examples of Statistics Sample mean:

X¯_n =

∑ⁿ_i=1X_i n Sample variance:

S²_n =

∑ⁿ_i=1(X_i−X¯_n)² n − 1 Sample r-th moments:

mr = 1 n

n

∑

i=1

X^r_i

Sample covariance/correlation coefficient:

S_XY =∑ⁿ_i=1(X_i−X¯_n)(Y_i−Y¯_n)

n − 1 , r_XY = S_XY SXSY

(17)

Sampling Distributions

Definition (Sampling Distribution)

Let random variable T_n =T(X¹,X², . . . ,X_n) be a function of random sample, then the distribution of T_n is called the sampling distribution.

(18)

Example 1

If {X_i}ⁿ_i=1 is a random sample from Bernoulli(p), then Tn =

n

∑

i=1

Xi ∼Binomial(n, p).

That is, Binomial distribution is the sampling distribution of T_n, which is a function of the Bernoulli random sample,{X_i}ⁿ_i=1.

(19)

Example 2

If {X_i}ⁿ_i=1 is a random sample from N(µ, σ²), then T_n =

n

∑

i=1

X_i ∼N (nµ, nσ²) .

T_n= 1 n

n

∑

i=1

X_i

´¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¶

X¯n

∼N (µ,σ² n) .

(20)

Example 3

Let {X_i} ∼^i.i.d. N(µ, σ²), X =¯ 1

n

∑

i=1

Xi, S²_n = 1 n − 1

n

∑

i=1

(Xi−X¯n)² Then it can be show that ¯Xn ⊥S_n², and

X¯_n−µ

√σ n

∼N(0, 1), (n − 1)S²_n

σ² ∼χ²(n − 1),

X¯_n−µ

Sn

√n

∼t(n − 1)

(21)

Example 3

Theorem (Daly’s Theorem)

Let {Xi}ⁿ_i=1∼^i.i.d. N(µ, σ²), and ¯Xn =_n¹ ∑ⁿ_i=1Xi. Suppose that g(X¹,X², . . . ,Xn) is translation invariant, that is,

g(X¹+c, X²+c, . . . , Xn+c) = g(X¹,X², . . . ,Xn)for all constant c.

Then ¯Xn and g(X¹,X², . . . ,Xn) are independent.

Proof: omitted here.

(22)

Section 4 Biased Samples

(23)

Biased Samples

Ideally, we would like our data to be a random sample from the target population. In practice, samples can be tainted by a variety of biases.

Two typical biases:

Selection bias Survivor bias

Reading: Gary Smith (2014) ‘Garbage In, Gospel Out’ in Standard Deviations: Flawed Assumptions, Tortured Data, and

(24)

Selection Bias

Definition

Selection bias occurs when the results are distorted because the sample systematically excludes or under-represents some elements of the population.

This particular kind of selection bias is also known as

self-selection bias because people choose to be in the sample.

We should be careful making comparisons to people who made different choices.

(25)

Self-Selection Bias Example 1

Scott Geller, a psychology professor at Virginia Tech, studied drinking in three bars near campus. He found that a drinker consumes more than twice as much beer if it comes in a pitcher than in a glass or bottle.

(26)

Self-Selection Bias Example 2

A study found that Harvard freshmen who had not taken SAT preparation courses scored an average of 63 points higher on the SAT than did Harvard freshmen who had taken such courses.

Harvard’s admissions director said that this study suggested that SAT preparation courses are ineffective.

(27)

Survivor Bias

Definition

Survivor bias is that when we choose a sample from a current population to draw inferences about a past population, we leave out members of the past population who are not in the current

population: We look at only the survivors.

Prospective study vs. Retrospective study

(28)

Survivor Bias

Example 1: Which Places Need Protection?

In World War II, the British Royal Air Force (RAF) planned to attach heavy plating to its airplanes to protect them from German fighter planes and land-based antiaircraft guns. The protective plates weighed too much to cover an entire plane, so the RAF collected data on the location of bullet and shrapnel holes on planes that returned from bombing runs.

(29)

Survivor Bias

Most holes on the wings and rear of the plane, and very few on the cockpit, engines, or fuel tanks

(30)

Survivor Bias

Abraham Wald had recognized that these data suffered from survivor bias.

During World War II, Wald was a member of the Statistical Research Group (SRG) at Columbia University, where he applied his statistical skills to various wartime problems.

(31)

Survivor Bias

Example 2: Success Secrets

In writing his bestselling book Good to Great, Jim Collins and his research team spent five years looking at the forty-year history of 1,435 companies and identified 11 stocks that clobbered the average stock.

After scrutinizing these eleven great companies, Collins identified several common characteristics and attached catchy names to each, like Level 5 Leadership – leaders who are personally humble, but professionally driven to make their company great.