Statistical Computing

(1)

Hung Chen

[email protected] Department of Mathematics

National Taiwan University

18th February 2004

Meet at NW 405 On Wednesday from 9:10 to 12.

(2)

Course Overview

• Monte Carlo methods for statistical inference 1. Pseudo-random deviate

2. Non-uniform variate generation 3. Variance reduction methods

4. Jackknife and Bootstrap

5. Gibbs sampling and MCMC

• Data partitioning and resampling (bootstrap) 1. Simulation Methodology

2. Sampling and Permutations (Bootstrap and per- mutation methods)

• Numerical methods in statistics

1. Numerical linear algebra and linear regressions

(3)

2. Application to regression and nonparametric regression

3. Integration and approximations 4. Optimization and root finding

5. Multivariate analysis such as principal component analysis

• Graphical methods in computational statistics

• Exploring data density and structure

• Statistical models and data fitting

• Computing Environment: Statistical software R (“GNU’s S”)

– http://www.R-project.org/

– Input/Output

(4)

– Structured Programming

– Interface with other systems

• Prerequisite:

– Knowledge on regression analysis or multivariate analysis

– Mathematical statistics/Probability theory; Statis- tics with formula derivation

– Knowledge about statistical software and experi- ence on programming languages such as Fortran, C and Pascal.

• Text Books:

The course materials will be drawn from following recommended resources (some are available via Internet) and others that will be made available

(5)

through the handout.

– Gentle, J.E. (2002) Elements of Computational Statis- tics. Springer.

– Hastie, T., Tibshirani, T. , Friedman, J.H. (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag.

– Lange, K. (1999) Numerical analysis for statisti- cians. Springer-Verlag, New York

– Ripley, B.D. and Venables, V.W. and BD (2002)

Modern applied statistics with S, 4th edition. Springer- Verlag, New York.

Refer to http://geostasto.eco.uniroma1.it/utenti/liseo/

dispense R.pdf for a note of this book.

– Robert, C.P. and Casella, G. (1999). Monte Carlo

(6)

Statistical Methods. Springer Verlag.

– Stewart, G.W. (1996). Afternotes on Numerical Analysis. SIAM, Philadelphia.

– An Introduction to R by William N. Venables, David M. Smith

(http://www.ats.ucla.edu/stat/books/

#DownloadableBooks)

• Grading: HW 50%, Project 50%

(7)

Outline

• Statistical Learning

– Handwritten Digit Recognition – Prostate Cancer

– DNA Expression Microarrays – Language

• IRIS Data

– Linear discriminant analysis – R-programming

– Logistic regression – Risk minimization

– Discriminant Analysis

• Optimization in Inference

(8)

– Estimation by Maximum Likelihood – Optimization in R

– EM Methods

∗ Algorithm

∗ Genetics Example(Dempster, Laird, and Rubin;

1977)

– Normal Mixtures – Issues

• EM Methods: Motivating example – ABO blood groups

– EM algorithm

– Nonlinear Regression Models – Robust Statistics

(9)

• Probability statements in Statistical Inference

• Computing Environment: Statistical software R (“GNU’s S”)

– http://www.R-project.org/

– Input/Output

– Structured Programming

– Interface with other systems

(10)

Introduction: Statistical Computing in Practice

Computationally intensive methods have become widely used both for statistical inference and for exploratory analysis of data.

The methods of computational statistics involve

• resampling, partitioning, and multiple transforma- tions of a data set

• make use of randomly generated artificial data

• function approximation

Implementation of these methods often requires ad- vanced techniques in numerical analysis, so there is a close connection between computational statistics and statistical computing.

(11)

In this course, we first address some areas of application of computationally intensive methods, such as

• density estimation

• identification of structure in data

• model building

• optimization

(12)

Handwritten Digit Recognition

In order to devise an automatic sorting procedure for envelopes, we consider the problem of recognizing the handwritten ZIP codes on envelopes from U.S. postal mail.

This is a so-called pattern recognition problem in literature.

• Each image is a segment from a five digit ZIP code, isolating a single digit.

• The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255.

• The images have been normalized to have approxi- mately the same size and orientation.

• The task is to predict, from the 16×16 matrix of pixel

(13)

intensities, the identity of each image (0, 1, . . . , 9) quickly and accurately.

The dimensionality of x is 256.

Abstraction:

• Consider space X as matrices with entries in the interval [0, 1]-each entry representing a pixel in a certain grey scale of a photo of the handwritten letter or some features extracted from the letters.

• Let Y to be Y =







y ∈ R¹⁰ | y =

10

X

i=1

p_ie_i s.t. P₁₀

i=1 p_i = 1





 . Here e_i is the ith coordinate vector in R¹⁰ (each coordinate corresponding to a letter).

(14)

• If we only consider the set of points y with 0 ≤ p_i ≤ 1, for i = 1, . . . , 10, one can interpret in terms of a probability measure on the set {0, 1, 2, . . . , 10}.

• The problem is to learn the ideal function f : X → Y which associates, to a given handwritten digit x, the point {P rob(x = 0), P rob(x = 1), . . . , P rob(x = 9)}.

• Learning f means to find a sufficiently good approximation of f within a given prescribed class.

– For a two-class problem, think of logistic regression in survival analysis.

– Fisher discriminant analysis and SVM Further mathematical abstraction:

• Consider a measure ρ on X × Y where Y is the label set and X is the set of handwritten letters.

(15)

• The pairs (x_i, y_i) are randomly drawn from X × Y according to the measure ρ.

• y_i is a sample for a given x_i.

• The function to be learned is the regression function of f_ρ.

• f_ρ(x) is the average of the y values of {x} × Y .

• Difficulty: The probability measure ρ governing the sampling which is not known in advance.

How do we learn f ?

(16)

Prostate Cancer

To identify the risk factors for prostate cancer, Stamey et al. (1989), they examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical and demographic measures, in 97 men who were about to receive a radical prostatetectomy.

• The goal is to predict the log of PSA (lpsa) from a number of measurements including log-cancer-volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capuslar penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.

• This is a supervised learning problem, known as a re-

(17)

gression problem, because the outcome measurement is quantitative.

Let Y denote lpsa and X be those explanatory variables.

• We have data in the form of (x₁, y₁), . . . , (x_n, y_n).

• Use the squared error loss as the criterion of choosing the best prediction function, i.e.,

minθ E[Y − θ(X)]².

• What is the θ(·) which minimizes the above least- squares error in population version.

– Find c to minimize E(Y − c)².

– For every x ∈ X, let E(Y | X = x) be the conditional expectation.

(18)

– Regression function: θ(x) = E(Y | X = x) – Write Y as the sum of θ(X) and Y − θ(X).

∗ Conditional expectation: E(Y − θ(X) | X) = 0

∗ Conditional variance: V ar(Y | X) = E[(Y − θ(X))² | X)]

• ANOVA decomposition:

V ar(Y ) = E[V ar(Y | X)] + V ar[E(Y | X)]

– If we use E(Y ) to minimize prediction error E(Y − c)², its prediction error is V ar(Y ).

– If E(Y | X) is not a constant, the prediction error of θ(x) is smaller.

• Nonparametric regression: No functional form is as- sumed for θ(·)

(19)

– Suppose that θ(x) is of the form P_∞

i=1 w_iφ_i(x).

∗ What is {φ_i(x); i = 1, 2, . . .} and how to determine {w_i, i = 1, 2, . . .} with finite number of data?

∗ Estimate θ(x) by k-nearest-neighbor method such as

Y (x) =ˆ 1 k

X

x_i∈N_k(x)

y_i,

where N_k(x) is the neighborhood of x defined by the k closest points x_i in the training sample.

How do we choose k and N_k(x)?

• Empirical error:

– How do we measure the theoretical error E[Y − f (X)]²?

∗ Consider n examples, {(x₁, y₁), . . . , (x_n, y_n)}, in-

(20)

dependently drawn according to ρ.

∗ Define the empirical error of f to be E_n(f ) = 1

n

X

i=1

(y_i − f (x_i))². What is the empirical cdf?

∗ Is E_n(f ) close to E[Y − f (X)]²?

∗ Think of the following problem:

Is sample variance a consistent estimate of population variance?

Or the following holds in some sense 1

n

X

i

(y_i − ¯y)² → E(Y − E(Y ))².

∗ Can we claim that the minimizer ˆf of min_{f ∈F}(y_i−

(21)

f (x_i))² is close to the minimizer of min_{f ∈F} E(Y − f (X))²?

• Consider the problem of email spam.

– Consider the data which consists of information from 4601 email messages.

– The objective was to design an automatic spam detector that could filter out spam before clogging the users’ mailboxes.

– For all 4601 email messages, the true outcome (email type) email or spam is available, along with the relative frequencies of 57 of the most com- monly occurring words and punctuation marks in the email message.

– This is a classification problem.

(22)

DNA Expression Microarrays

DNA is the basic material that make up human chro- mosomes.

• The regulation of gene expression in a cell begins at the level of transcription of DNA into mRNA.

• Although subsequent processes such as differential degradation of mRNA in the cytoplasm and differential translation also regulate the expression of genes, it is of great interest to estimate the relative quan- tities of mRNA species in populations of cells.

• The circumstances under which a particular gene is up- or down-regulated provide important clues about gene function.

• The simultaneous expression profiles of many genes

(23)

can provide additional insights into physiological processes or disease etiology that is mediated by the coordinated action of sets of genes.

Spotted cDNA microarrays (Brown and Botstein 1999) are emerging as a powerful and cost-effective tool for large scale analysis of gene expression.

• In the first step of the technique, samples of DNA clones with known sequence content are spotted and immobilized onto a glass slide or other substrate, the microarray.

• Next, pools of mRNA from the cell populations under study are purified, reverse-transcribed into cDNA, and labeled with one of two fluorescent dyes, which we will refer to as “red” and “green.”

(24)

• Two pools of differentially labeled cDNA are com- bined and applied to a microarray.

• Labeled cDNA in the pool hybridizes to complemen- tary sequences on the array and any unhybridized cDNA is washed off.

The result is a few thousand numbers, typically ranging from say −6 to 6, measuring the expression level for each gene in the target relative to the reference sample. As an example, we have a data set with 64 samples (column) and 6830 genes (rows). The chal- lenge is to understand how the genes and samples are organized. Typical questions are as follows:

• Which samples are most similar to each other, in terms of their expression profiles across genes?

(25)

– Think of the samples as points in 6830-dimensional space, which we want to cluster together in some way.

• Which genes are most similar to each other, in terms of their expression profile across samples?

• Do certain genes show very high (or low) expression for certain cancer samples?

– Feature selection problem.

(26)

Statistical Language

For the examples we just described, they have several components in common.

• For each there is a set of variables that might be denoted as inputs, which are measured or present.

These have some influence on one or more outputs.

• For each example, the goal is to use the inputs to predict the values of the outputs. In machine learning language, this exercise is called supervised learning.

• In statistical literature the inputs are often called the predictors or more classically the independent variables.

The outputs are called responses, or classically the

(27)

dependent variables.

• The output can be a quantitative measurement, where some measurements are bigger than others, and measurements close in value are close in nature.

– EX 1. Consider the famous Iris discrimination examples (Fisher, 1936). In this data set, there are 150 cases with 50 cases per class. The output is qualitative (species of Iris) and assumes values in a finite set G = {Virginica, Setosa and Versicolor}.

There are four predictors: sepal length, sepal width, petal length, and petal width.

– EX 2. In the handwritten digit example, the output is one of 10 different digit class.

– In Ex1 and 2, there is no explicit ordering in the

(28)

classes, and in fact often descriptive labels rather than numbers are used to denote the classes.

Qualitative variables are often referred to as categorical or discrete variables as well as factors.

– Ex 3. For given specific atmospheric measurements today and yesterday, we want to predict the ozone level tomorrow.

Given the grayscale values for the pixels of the digitized image of the handwritten digit, we want to predict its class labels.

– For all three examples, we think of using the inputs to predict the output. The distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we

(29)

predict qualitative outputs.

– For regression and classification, both can be viewed as a task in function approximation.

– For qualitative variables, they are typically represented numerically by codes.

• A third variable type is ordered categorical, such as small, medium and large, where there is an ordering between the values, but no metric notion is appro- priate (the difference between medium and small need not be the same as that between large and medium).

(30)

IRIS Data

First applied in 1935 by M. Barnard at the suggestion of R.A. Fisher (1936).

• Fisher linear discriminant analysis (FLDA): It consists of

– Find linear combination x^Ta of x = (x₁, . . . , x_p) to maximize the the ratio of “between group” and

“within group variances.

– x^Ta is called discriminant variables.

– Predicting the class of an observation x by the class whose mean vector is closest to x of the discriminant variables.

– Represent Class k by (µ_k, Σ).

– Define B₀ = P₃

k=1(µ_k − ¯µ)(µ_k − ¯µ)^T.

(31)

– Identify eigenvalues and eigenvectors of Σ⁻¹B₀. maxa

a^TB₀a a^TΣa .

– The problem of learning is that of choosing from the given set of functions x^Ta, the one which pre- dicts the supervisor’s response in the best possible way.

∗ How do we quantify it?

∗ How will the variability of ˆa affect the prediction?

• R-programming – data(iris)

– attach(iris)

– help.search(”discriminant analysis”)

(32)

– Linear Discriminant Analysis: lda

• Logistic regression – Model

log P (y = 1 | x)

1 − P (y = 1 | x) = α + x^Tβ

– The coefficients α and β need to be estimated iteratively (writing down the likelihood and finding the MLE).

The “scoring method” or “iterative weighted least squares.”

– Convert a classification problem to an estimation problem.

• Problem of risk minimization

(33)

– The loss or discrepancy: L(y, f (x|α))

Here y is the supervisor’s response to given input x and f (x|α) is the response provided by the learning machine.

– The risk functional: the expected value of the loss or

R(α) = Z

L(y, f (x|α))dρ(x, y).

– Goal: Find the function which minimizes the risk functional R(α) (over the class of functions f (x|α), α ∈ A, in the situation where the joint probability distribution is unknown and the only available information is contained in the training set.

– Pattern recognition

∗ The supervisor’s output y take on only two val-

(34)

ues {0, 1}.

∗ f (x|α), α ∈ A, are a set of indicator functions the (functions which take on only two values zero and one).

∗ The loss-function:

L(y, f (x|α)) = 0 if y = f (x|α), 1 if y 6= f (x|α),

∗ For this loss function, the risk functional pro- vides the probability of classification error (i.e., when the answers given by supervisor and the answers given by indicator function differ).

∗ The problem, therefore, is to find the function which minimizes the probability of classification errors when probability measure is unknown, but the data are given.

(35)

• Formulation of Discriminant Analysis

– Objective: Distinguish two classes based on the observed covariates (and training data).

– Data: {(x₁, y₁), . . . , (x_`, y_`)}, y_i = 0 or 1.

– Goal: Make decision on Y based on X.

(x, y) has an unknown probability distribution ρ(x, y).

– The Bayes Solution: (minimum error rule)

P (Y = y | X) = P (X | Y = y)π(Y = y)

P (X | Y = 0)π(Y = 0) + P (X | Y = 1)π(Y = 1). It assumes that the cost of different type of mis-

classification are the same. (Costs may not be the same)

– Practical problems: Prior? Distributions?

(36)

Role of Optimization in Inference

Many important classes of estimators are defined as the point at which some function that involves the parameter and the random variable achieves an optimum.

• Estimation by Maximum Likelihood:

– Concept intuitively

– Nice mathematical properties

– Give a sample y₁, . . . , y_n from a distribution with pdf or pmf p(y | θ_∗), MLE of θ is the value that maximizes the joint density or joint probability with variable θ at the observed sample value: Q

i p(y_i | θ).

(37)

• Likelihood function

L_n(θ; y) =

n

Y

i=1

p(y_i | θ).

The value of θ for which L_n achieves its maximum value is the MLE of θ_∗.

• Critique: How to determine the form of the function p(y | θ)?

• The data-that is, the realizations of the variables in pdf or pmf- are considered as fixed, and the parameters are considered as variables of the optimization problem,

maxθ L_n(θ; y).

• Questions:

(38)

– Does the solution exist?

– Existence of local optima of the objective function.

– Constraints on possible parameter values.

• MLE may not be a good estimation scheme.

• Penalized maximum likelihood estimation

• Optimization in R

– nlm is the general function for “nonlinear minimization.”

∗ It can make use of a gradient and/or Hessian, and it can give an estimate of the Hessian at the minimum.

∗ This function carries out a minimization of the function f using a Newton-type algorithm.

(39)

∗ It needs starting parameter values for the minimization.

∗ See demo(nlm) for examples.

– Minimize function of a single variable

∗ optimize: It searches the interval from ’lower’

to ’upper’ for a minimum or maximum of the function f with respect to its first argument.

∗ uniroot: It searches the interval from ’lower’ to

’upper’ for a root of the function f with respect to its first argument.

– D and deriv do symbolic differentiation.

• Example 1: Consider x ∼ mult(n, p) where p = ((2 + θ)/4, (1 − θ)/2, θ/4).

– Suppose we observe the data x = (100, 94, 6).

(40)

– The negative log likelihood is ln n! − (X

i

ln n_i!) + x₁ ln((2 + θ)/4) + x₂ ln((1 − θ)/2) +x₃ ln(θ/4).

– In R, it can be written as

nll < −f unction(theta, x) − sum(x ∗ log(c((2 + theta)/4, (1 − theta)/2, theta/4)))

– Using the nlm function, we get

nlm(nll, 0.1, typsize = 0.01, hessian = TRUE, x = c(100, 94, 6))

– We have ˆθ = 0.104 with an estimated SE ≈ 1/√

689.7 ≈ 0.04.

(41)

EM Methods

• Goals: Provide an iterative scheme for obtaining maximum likelihood estimates, replacing a hard problem by a sequence of simpler problems.

• Context: Though most apparent in the context of missing data, it is quite useful in other problems as well.

The key is to recognize a situation where, if you had more data, the optimization would be simplified.

• Approach: By augmenting the observed data with some additional random variables, one can often convert a difficult maximum likelihood problem into one which can be solved simply, though requiring iteration.

(42)

– Treat the observed data Y as a function Y = Y(X) of a larger set of unobserved complete data X, in effect treating the density

g(y; θ) = Z

X (y)

f (x; θ)dx.

The trick is to find the right f so that the resulting maximization is simple, since you will need to iterate the calculation.

• Computational Procedure: The two steps of the calculation that give the algorithm its name are

1. Estimate the sufficient statistics of the complete data X given the observed data Y and current parameter values,

2. Maximize the X-likelihood associated with these

(43)

estimated statistics.

• Genetics Example (Dempster, Laird, and Rubin; 1977):

Observe counts

y = (y₁, y₂, y₃, y₄) = (125, 18, 20, 34)

∼ M ult(1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4) where 0 ≤ θ ≤ 1.

– Estimate θ by solving the score equation y₁

2 + θ − y₂ + y₃

1 − θ + x₄

θ = 0.

It is a quadratic equation in θ.

– Think of y as a collapsed version (y₁ = x₀ + x₁) of

x = (x₀, x₁, x₂, x₃, x₄) ∼ M ult(1/2, θ/4, (1−θ)/4, (1−θ)/4, θ/4).

(44)

Step 1. Estimate x₀ and x₁ given y₁ = 125 and an estimate θ⁽ⁱ⁾ implies that

x⁽ⁱ⁾₀ = 125 1/2

1/2 + θ⁽ⁱ⁾/4 and x⁽ⁱ⁾₁ = 125 θ⁽ⁱ⁾/4

1/2 + θ⁽ⁱ⁾/4. Conditional distribution of X₁ given X₀ + X₁ = 125 is

Bin

125, θ/4 1/2 + θ/4

.

Step 2. Maximize the resulting binomial problem, obtaining θ⁽ⁱ⁺¹⁾ = (x⁽ⁱ⁾₁ + 34)/(x⁽ⁱ⁾₁ + 18 + 20 + 34).

Given the complete data, MLE of θ is θ =ˆ x₁ + x₄

x₁ + x₂ + x₃ + x₄.

– Starting from an initial value of 0.5, the algorithm moved for eight steps as following: 0, 608247423,

(45)

0, 624321051, 0.626488879, 0.626777323, 0.62677323, 0.626815632, 0.626820719, 0.626821395, 0.626821484.

– If E − step is hard,

∗ replace it by Monte Carlo approach (MCEM)

∗ Wei and Tanner (1990, JASA)

• Mixture models:

Suppose that the observed data Y is a mixture of samples from k populations, but that the mixture indicators Y_miss are unknown.

Think of Y_miss = (0, 0, . . . , 0) as a k-vector with one position one and the rest zero.

– The complete data is X = (Y, Y_miss).

Step 1. Estimate the group membership probability for each Y_i given the current parameter estimates.

(46)

Step 2. Maximize the resulting likelihood, finding in effect the weighted parameter estimates.

• References

– Dempster, A.P., N.M. Laird, and Rubin, D.R. (1977).

Maximum likelihood from incomplete data via the EM algorithm. JRSS-B, 39, 1-38.

– Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley, New York.

– Tanner, M.A. (1993). Tools for Statistical Infer- ence. Springer, New York.

• Success?

Theory shows that the EM algorithm has some very appealing monotonicity properties, improving the likelihood at each iteration.

(47)

Though often slow to converge, it does get there!

(48)

EM Methods: Motivating example Consider the example on ABO blood groups.

Genotype Phenotype Gen freq

AA A p²_A

AO A 2p_Ap_O

BB B p²_B

BO B 2p_Bp_O

OO O p²_O

AB AB 2p_Ap_B

• The genotype frequencies above assume Hardy-Weinberg equilibrium.

• Imagine we sample n individuals (at random) and observe their phenotype (but not their genotype).

• We wish to obtain the MLES of the underlying allele

(49)

frequencies p_A,p_B, and p_O.

• Observe n_A = n_AA + n_AO, n_B = n_BB + n_BO, n_O = n_OO, and n_AB, the numbers of individuals with each of the four phenotypes.

• We could, of course, form the likelihood function and find its maximum. (There are two free parameters.) But long ago, RA Fisher (or others?) came up with the following (iterative) “allele counting”

algorithm

Allele counting algorithm:

Let n_AA, n_AO, n_BB, and n_BO be the (unobserved) numbers of individuals with genotypes AA, AO, BB, and BO, respectively.

Here’s the algorithm:

(50)

1. Start with initial estimates pˆ⁽⁰⁾ = ( ˆp⁽⁰⁾_A , pˆ⁽⁰⁾_B , pˆ⁽⁰⁾_O ).

2. Calculate the expected numbers of individuals in each of the genotype classes, given the observed numbers of individuals in each phenotype class and given the current estimates of the allele frequencies.

For example:

n^(s)_AA = E(n_AA | n_AA, pˆ^(s−1))

= n_Apˆ^(s−1)_A /( ˆp^(s−1)_A + 2 ˆp^(s−1)_O ).

3. Get new estimates of the allele frequencies, imagin-

(51)

ing that the expected n⁰s were actually observed.

ˆ

p^(s)_A = (2n^(s)_AA + n^(s)_AO + n_AB)/n ˆ

p^(s)_B = (2n^(s)_BB + n^(s)_BO + n_AB)/n ˆ

p^(s)_O = (n^(s)_AO + n^(s)_BO + 2n_O)/n.

4. Repeat steps (2) and (3) until the estimate con- verges.

(52)

EM algorithm

Consider X ∼ f (x | θ) where X = (X_obs, X_miss) and f (x_obs | θ) = R f (x_obs, x_miss | θ)dx_miss.

• Observe x_obs but not x_miss.

• Wish to find the MLE ˆθ = arg max_θ f (x_obs | θ).

• In many cases, this can be quite difficult directly, but if we had observed x_obs, it would be easy to find

θˆ^C = arg max

θ f (x_obs, x_miss | θ).

EM algorithm:

E step:

`^(s)(θ) = E{log f (x_obs, x_miss | θ) | x_obs, ˆθ^(s)}

(53)

M step:

θˆ^(s+1) = arg max

θ `^(s)(θ) Remarks:

• Nice property: The sequence `[ˆθ^(s)] is non-decreasing.

• Exponential family: `(θ | x) = T (x)^tη(θ) − B(θ).

– T (x) are the sufficient statistics.

– Suppose x = (y, z) where y is observed and z is missing.

E step: Calculate W^(s) = E{T (x) | y, ˆθ^(s−1)}

M step: Determine ˆθ^(s) solving E{T (x) | θ} = W^(s).

– Refer to Wu (1983, Ann Stat 11:95-103) on the convergence of EM algorithm.

(54)

Normal Mixtures

Finite mixtures are a common modelling technique.

• Suppose that an observable y is represented as n observations y = (y₁, . . . , y_n).

• Suppose further that there exists a finite set of J states, and that each y_i is associated with an unobserved state.

Thus, there exists an unobserved vector q = (q₁, . . . , q_J), where q_i is the indicator vector of length J whose components are all zero except for one equal to unity indicating the unobserved state associated with y_i.

• Define the complete data to be x = (y, q).

A natural way to conceptualize mixture specifications is to think of the marginal distribution of the indicators

(55)

q, and then to specify the distribution of y given q.

• Assume that the y_i given q_i are conditionally independent with densities f (y_i | q_i).

• Consider x₁, . . . , x_n ^iid∼ P_J

j=1 p_jf (x_i | µ_j, σ) where f (· | µ, σ) is the normal density.

(Here we put the SD rather than the variance here.)

• Let

y_ij = 1 if x_i is drawn from N (µ_j, σ) 0 otherwise

so that P

j y_ij = 1.

(x_i) is the observed data; (x_i, y_i) is the complete data.

(56)

• The following is the unobserved Complete data log likelihood

`(µ, σ, p | x, y) = X

i

X

j

y_ij{log p_j + log f (x_i | µ_j, σ)}

– How do we estimate it?

– Sufficient statistics S_1j = X

i

y_ij S_2j = X

i

y_ijx_i S_3j = X

i

y_ijx²_i – E step

w_ij^(s) = E[y_ij | x_i, pˆ^(s−1), µˆ^(s−1), σˆ^(s−1)]

= P r[y_ij = 1 | x_i, pˆ^(s−1), µˆ^(s−1), σˆ^(s−1)]

=

ˆ

p^(s−1)_j f (x_i | ˆµ^(s−1)_j , σˆ^(s−1)) P

j pˆ^(s−1)_j f (x_i | ˆµ^(s−1)_j , σˆ^(s−1))

(57)

Hence

S_1j^(s) = X

i

w_ij^(s), S_2j^(s) = X

i

w_ij^(s)x_i, S_3j^(s) = X

i

w_ij^(s)x²_i. – M step

ˆ

p^(s)_j = S_1j^(s)/n, ˆ

µ^(s)_j = S_2j^(s)/S_1j^(s), ˆ

σ^(s) =

s

X

j

n

S_3j^(s) − [S_2j^(s)]²S_1j^(s) o

/n

(58)

Issues I. Stopping rules

1. | `( ˆθ^(s+1)) − `( ˆθ^(s)) |< for m consecutive steps.

This is bad! ` may not change much even when θ does.

2. k ˆθ^(s+1) − ˆθ^(s)k < for m consecutive steps.

This runs into problems when the components of θ are of quite different magnitudes.

3. | ˆθ_j^(s+1) − ˆθ_j^(s) |< ₁(| ˆθ_j^(s) | +₂) for j = 1, . . . , p In practice, take

₁ = √

machine ≈ 10⁻⁸ ₂ = 10₁ to 100₁ II. Local vs global max

– There may be many modes.

(59)

– EM may converge to a saddle point Solution: Many starting points

III. Starting points

– Use information from the context

– Use a crude method (such as the method of mo- ments)

– Use an alternative model formulation IV. Slow convergence

The EM algorithm can be painfully slow to converge near the maximum.

Solution: Switch to another optimization algorithm when you get near the maximum.

V. Standard errors

(60)

– Numerical approximation of the Fisher information (ie, the Hessian)

– Louis (1982), Meng and Rubin(1991)

(61)

Probability statements in Statistical Inference

In hypothesis testing, the inferential methods depend on probabilities of two types of errors. In confidence intervals the decisions are associated with probability statements about coverage of the parameters. For both cases the probability statements are based on the distribution of a random sample, Y₁, . . . , Y_n.

• Simulation of data generating process. (Statistical experiment)

• Monte Carlo expectation

• Study the random process