## Statistics

Random Samples, Statistics and Sampling Distributions

Shiu-Sheng Chen

Department of Economics National Taiwan University

Fall 2019

## Section 1

## Random Samples and Descriptive Statistics

Random Samples

Definition (Random Samples)

A random sample with size n, {Xi}^{n}_{i=1}= {X^{1},X^{2}, . . . ,Xn}, is a set of
i.i.d. random variables.

Random samples are also called I.I.D. samples.

Notation

{X_{i}}^{n}_{i=1}∼i.i.d.

(µ, σ^{2})
Properties

E(X^{1}) =E(X^{2}) = ⋅ ⋅ ⋅ =E(X_{n}) =µ

Descriptive Statistics

Frequency Distribution Table

Empirical Density Function (Histogram) Empirical Distribution Function

Statistics and Sampling Distribution

Example: Statistics Midterm Exam

1st Midterm Exam Scores of 167 students in 2018

69 5 66 88 73 96 88 92 67 79 74 72 73 63 66 73 60 78 50 86 64 69 40 59 71 32 74 72 87 83 71 87 90 79 57 84 67 78 71 80 51 70 56 99 61 31 46 96 87 73 72 81 72 84 77 75 38 91 82 15 69 75 49 62 13 58 74 79 44 72 84 70 68 37 57 61 43 71 71 36 48 36 35 65 83 69 63 59 46 79 58 82 81 68 50 88 35 55 80 71 59 76 87 71 50 65 76 29 37 68 40 72 47 39 84 58 49 43 83 55 44 73 54 53 56 54 59 79 61 98 69 84 82 74 59 85 64 70 85 78 84 78 63 59 85 57 25

R Code

**R Example (Data Loading and Frequency Distribution Table)**

## 讀取資料

dat = read.csv(’2018Midterm1.csv.csv’, header=TRUE) Midterm = dat$Midterm

## 建構次數分配表

breaks = seq(0, 100, by=5)

Midterm.cut = cut(Midterm, breaks, right=FALSE) Midterm.freq = table(Midterm.cut)

Midterm.freq

Frequency Distribution Table

Midterm.cut

[0,5) [5,10) [10,15) [15,20) [20,25) [25,30) [30,35)

0 1 1 1 0 2 2

[35,40) [40,45) [45,50) [50,55) [55,60) [60,65) [65,70)

8 6 7 7 17 11 16

[70,75) [75,80) [80,85) [85,90) [90,95) [95,100)

27 17 18 13 4 6

>

Empirical Density Function

Empirical Density Function Histogram

Relative frequency distribution

R Code

**R Example (Histogram)**

## 繪製直方圖

hist(Midterm, breaks=10, right=FALSE, xlab=’Midterm’, main=’Histogram of Midterm’)

Empirical Density Function

**Histogram of Midterm**

Midterm

Frequency

0 20 40 60 80 100

010203040

Empirical Distribution Function

Definition (Empirical Distribution Function)

Given random sample {Xi}^{n}_{i=1}∼^{i.i.d.} FX(x), the empirical distribution
function (EDF) is defined as

Fˆ_{n}(x) = number of elements in the sample ≤x

n =

1 n

n

∑

i=1

I_{{X}_{i}_{≤x}}

R Code

**R Example (Empirical Density Function)**

## 繪製 _{EDF}

medf <- ecdf(Midterm)

plot(medf, main=’EDF of Midterm’)

Empirical Distribution Function

0.20.40.60.81.0

**EDF of Midterm**

Fn(x)

## Section 2

## Statistics

Statistics

Definition (Statistic)

Any function of the random sample is called a statistic:

T_{n}=T(X^{1},X^{2}, . . . ,X_{n}).

A statistic does not contain unknown parameters.

The subscript n indicates the sample size.

Examples of Statistics Sample mean:

X¯_{n} =

∑^{n}_{i=1}X_{i}
n
Sample variance:

S^{2}_{n} =

∑^{n}_{i=1}(X_{i}−X¯_{n})^{2}
n − 1
Sample r-th moments:

mr = 1 n

n

∑

i=1

X^{r}_{i}

Sample covariance/correlation coefficient:

S_{XY} =∑^{n}_{i=1}(X_{i}−X¯_{n})(Y_{i}−Y¯_{n})

n − 1 , r_{XY} = S_{XY}
SXSY

Sampling Distributions

Definition (Sampling Distribution)

Let random variable T_{n} =T(X^{1},X^{2}, . . . ,X_{n}) be a function of random
sample, then the distribution of T_{n} is called the sampling distribution.

Example 1

If {X_{i}}^{n}_{i=1} is a random sample from Bernoulli(p), then
Tn =

n

∑

i=1

Xi ∼Binomial(n, p).

That is, Binomial distribution is the sampling distribution of T_{n},
which is a function of the Bernoulli random sample,{X_{i}}^{n}_{i=1}.

Example 2

If {X_{i}}^{n}_{i=1} is a random sample from N(µ, σ^{2}), then
T_{n} =

n

∑

i=1

X_{i} ∼N (nµ, nσ^{2}) .

T_{n}=
1
n

n

∑

i=1

X_{i}

´¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¶

X¯n

∼N (µ,σ^{2}
n) .

Example 3

Let {X_{i}} ∼^{i.i.d.} N(µ, σ^{2}),
X =¯ 1

n

n

∑

i=1

Xi, S^{2}_{n} =
1
n − 1

n

∑

i=1

(Xi−X¯n)^{2}
Then it can be show that ¯Xn ⊥S_{n}^{2}, and

X¯_{n}−µ

√σ n

∼N(0, 1), (n − 1)S^{2}_{n}

σ^{2} ∼χ^{2}(n − 1),

X¯_{n}−µ

Sn

√n

∼t(n − 1)

Example 3

Theorem (Daly’s Theorem)

*Let {X*i}^{n}_{i=1}∼^{i.i.d.} N(µ, σ^{2}), and ¯Xn =_{n}^{1} ∑^{n}_{i=1}Xi*. Suppose that*
g(X^{1},X^{2}, . . . ,Xn) *is* *translation invariant, that is,*

g(X^{1}+c, X^{2}+c, . . . , Xn+c) = g(X^{1},X^{2}, . . . ,Xn)*for all constant* *c.*

*Then ¯*Xn *and* g(X^{1},X^{2}, . . . ,Xn) *are independent.*

Proof: omitted here.

## Section 4

## Biased Samples

Biased Samples

Ideally, we would like our data to be a random sample from the target population. In practice, samples can be tainted by a variety of biases.

Two typical biases:

Selection bias Survivor bias

Reading: Gary Smith (2014) ‘Garbage In, Gospel Out’ in
*Standard Deviations: Flawed Assumptions, Tortured Data, and*

Selection Bias

Definition

Selection bias occurs when the results are distorted because the sample systematically excludes or under-represents some elements of the population.

This particular kind of selection bias is also known as

self-selection bias because people choose to be in the sample.

We should be careful making comparisons to people who made different choices.

Self-Selection Bias Example 1

Scott Geller, a psychology professor at Virginia Tech, studied drinking in three bars near campus. He found that a drinker consumes more than twice as much beer if it comes in a pitcher than in a glass or bottle.

Self-Selection Bias Example 2

A study found that Harvard freshmen who had not taken SAT preparation courses scored an average of 63 points higher on the SAT than did Harvard freshmen who had taken such courses.

Harvard’s admissions director said that this study suggested that SAT preparation courses are ineffective.

Survivor Bias

Definition

Survivor bias is that when we choose a sample from a current population to draw inferences about a past population, we leave out members of the past population who are not in the current

population: We look at only the survivors.

Prospective study vs. Retrospective study

Survivor Bias

Example 1: Which Places Need Protection?

In World War II, the British Royal Air Force (RAF) planned to attach heavy plating to its airplanes to protect them from German fighter planes and land-based antiaircraft guns. The protective plates weighed too much to cover an entire plane, so the RAF collected data on the location of bullet and shrapnel holes on planes that returned from bombing runs.

Survivor Bias

Example 1: Which Places Need Protection?

Most holes on the wings and rear of the plane, and very few on the cockpit, engines, or fuel tanks

Survivor Bias

Example 1: Which Places Need Protection?

Abraham Wald had recognized that these data suffered from survivor bias.

During World War II, Wald was a member of the Statistical Research Group (SRG) at Columbia University, where he applied his statistical skills to various wartime problems.

Survivor Bias

Example 2: Success Secrets

In writing his bestselling book Good to Great, Jim Collins and his research team spent five years looking at the forty-year history of 1,435 companies and identified 11 stocks that clobbered the average stock.

After scrutinizing these eleven great companies, Collins identified several common characteristics and attached catchy names to each, like Level 5 Leadership – leaders who are personally humble, but professionally driven to make their company great.