Statistical Computing
Hung Chen
hchen@math.ntu.edu.tw Department of Mathematics
National Taiwan University
18th February 2004
Meet at NW 405 On Wednesday from 9:10 to 12.
Course Overview
• Monte Carlo methods for statistical inference 1. Pseudo-random deviate
2. Non-uniform variate generation 3. Variance reduction methods
4. Jackknife and Bootstrap
5. Gibbs sampling and MCMC
• Data partitioning and resampling (bootstrap) 1. Simulation Methodology
2. Sampling and Permutations (Bootstrap and per- mutation methods)
• Numerical methods in statistics
1. Numerical linear algebra and linear regressions
2. Application to regression and nonparametric re- gression
3. Integration and approximations 4. Optimization and root finding
5. Multivariate analysis such as principal component analysis
• Graphical methods in computational statistics
• Exploring data density and structure
• Statistical models and data fitting
• Computing Environment: Statistical software R (“GNU’s S”)
– http://www.R-project.org/
– Input/Output
– Structured Programming
– Interface with other systems
• Prerequisite:
– Knowledge on regression analysis or multivariate analysis
– Mathematical statistics/Probability theory; Statis- tics with formula derivation
– Knowledge about statistical software and experi- ence on programming languages such as Fortran, C and Pascal.
• Text Books:
The course materials will be drawn from follow- ing recommended resources (some are available via Internet) and others that will be made available
through the handout.
– Gentle, J.E. (2002) Elements of Computational Statis- tics. Springer.
– Hastie, T., Tibshirani, T. , Friedman, J.H. (2001) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag.
– Lange, K. (1999) Numerical analysis for statisti- cians. Springer-Verlag, New York
– Ripley, B.D. and Venables, V.W. and BD (2002)
Modern applied statistics with S, 4th edition. Springer- Verlag, New York.
Refer to http://geostasto.eco.uniroma1.it/utenti/liseo/
dispense R.pdf for a note of this book.
– Robert, C.P. and Casella, G. (1999). Monte Carlo
Statistical Methods. Springer Verlag.
– Stewart, G.W. (1996). Afternotes on Numerical Analysis. SIAM, Philadelphia.
– An Introduction to R by William N. Venables, David M. Smith
(http://www.ats.ucla.edu/stat/books/
#DownloadableBooks)
• Grading: HW 50%, Project 50%
Outline
• Statistical Learning
– Handwritten Digit Recognition – Prostate Cancer
– DNA Expression Microarrays – Language
• IRIS Data
– Linear discriminant analysis – R-programming
– Logistic regression – Risk minimization
– Discriminant Analysis
• Optimization in Inference
– Estimation by Maximum Likelihood – Optimization in R
– EM Methods
∗ Algorithm
∗ Genetics Example(Dempster, Laird, and Rubin;
1977)
– Normal Mixtures – Issues
• EM Methods: Motivating example – ABO blood groups
– EM algorithm
– Nonlinear Regression Models – Robust Statistics
• Probability statements in Statistical Inference
• Computing Environment: Statistical software R (“GNU’s S”)
– http://www.R-project.org/
– Input/Output
– Structured Programming
– Interface with other systems
Introduction: Statistical Computing in Practice
Computationally intensive methods have become widely used both for statistical inference and for exploratory analysis of data.
The methods of computational statistics involve
• resampling, partitioning, and multiple transforma- tions of a data set
• make use of randomly generated artificial data
• function approximation
Implementation of these methods often requires ad- vanced techniques in numerical analysis, so there is a close connection between computational statistics and statistical computing.
In this course, we first address some areas of applica- tion of computationally intensive methods, such as
• density estimation
• identification of structure in data
• model building
• optimization
Handwritten Digit Recognition
In order to devise an automatic sorting procedure for envelopes, we consider the problem of recognizing the handwritten ZIP codes on envelopes from U.S. postal mail.
This is a so-called pattern recognition problem in lit- erature.
• Each image is a segment from a five digit ZIP code, isolating a single digit.
• The images are 16×16 eight-bit grayscale maps, with each pixel ranging in intensity from 0 to 255.
• The images have been normalized to have approxi- mately the same size and orientation.
• The task is to predict, from the 16×16 matrix of pixel
intensities, the identity of each image (0, 1, . . . , 9) quickly and accurately.
The dimensionality of x is 256.
Abstraction:
• Consider space X as matrices with entries in the in- terval [0, 1]-each entry representing a pixel in a cer- tain grey scale of a photo of the handwritten letter or some features extracted from the letters.
• Let Y to be Y =
y ∈ R10 | y =
10
X
i=1
piei s.t. P10
i=1 pi = 1
. Here ei is the ith coordinate vector in R10 (each coordinate corresponding to a letter).
• If we only consider the set of points y with 0 ≤ pi ≤ 1, for i = 1, . . . , 10, one can interpret in terms of a probability measure on the set {0, 1, 2, . . . , 10}.
• The problem is to learn the ideal function f : X → Y which associates, to a given handwritten digit x, the point {P rob(x = 0), P rob(x = 1), . . . , P rob(x = 9)}.
• Learning f means to find a sufficiently good approx- imation of f within a given prescribed class.
– For a two-class problem, think of logistic regres- sion in survival analysis.
– Fisher discriminant analysis and SVM Further mathematical abstraction:
• Consider a measure ρ on X × Y where Y is the label set and X is the set of handwritten letters.
• The pairs (xi, yi) are randomly drawn from X × Y according to the measure ρ.
• yi is a sample for a given xi.
• The function to be learned is the regression function of fρ.
• fρ(x) is the average of the y values of {x} × Y .
• Difficulty: The probability measure ρ governing the sampling which is not known in advance.
How do we learn f ?
Prostate Cancer
To identify the risk factors for prostate cancer, Stamey et al. (1989), they examined the correlation between the level of prostate specific antigen (PSA) and a num- ber of clinical and demographic measures, in 97 men who were about to receive a radical prostatetectomy.
• The goal is to predict the log of PSA (lpsa) from a number of measurements including log-cancer-volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capuslar penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.
• This is a supervised learning problem, known as a re-
gression problem, because the outcome measurement is quantitative.
Let Y denote lpsa and X be those explanatory vari- ables.
• We have data in the form of (x1, y1), . . . , (xn, yn).
• Use the squared error loss as the criterion of choos- ing the best prediction function, i.e.,
minθ E[Y − θ(X)]2.
• What is the θ(·) which minimizes the above least- squares error in population version.
– Find c to minimize E(Y − c)2.
– For every x ∈ X, let E(Y | X = x) be the condi- tional expectation.
– Regression function: θ(x) = E(Y | X = x) – Write Y as the sum of θ(X) and Y − θ(X).
∗ Conditional expectation: E(Y − θ(X) | X) = 0
∗ Conditional variance: V ar(Y | X) = E[(Y − θ(X))2 | X)]
• ANOVA decomposition:
V ar(Y ) = E[V ar(Y | X)] + V ar[E(Y | X)]
– If we use E(Y ) to minimize prediction error E(Y − c)2, its prediction error is V ar(Y ).
– If E(Y | X) is not a constant, the prediction error of θ(x) is smaller.
• Nonparametric regression: No functional form is as- sumed for θ(·)
– Suppose that θ(x) is of the form P∞
i=1 wiφi(x).
∗ What is {φi(x); i = 1, 2, . . .} and how to deter- mine {wi, i = 1, 2, . . .} with finite number of data?
∗ Estimate θ(x) by k-nearest-neighbor method such as
Y (x) =ˆ 1 k
X
xi∈Nk(x)
yi,
where Nk(x) is the neighborhood of x defined by the k closest points xi in the training sample.
How do we choose k and Nk(x)?
• Empirical error:
– How do we measure the theoretical error E[Y − f (X)]2?
∗ Consider n examples, {(x1, y1), . . . , (xn, yn)}, in-
dependently drawn according to ρ.
∗ Define the empirical error of f to be En(f ) = 1
n
n
X
i=1
(yi − f (xi))2. What is the empirical cdf?
∗ Is En(f ) close to E[Y − f (X)]2?
∗ Think of the following problem:
Is sample variance a consistent estimate of population variance?
Or the following holds in some sense 1
n
X
i
(yi − ¯y)2 → E(Y − E(Y ))2.
∗ Can we claim that the minimizer ˆf of minf ∈F(yi−
f (xi))2 is close to the minimizer of minf ∈F E(Y − f (X))2?
• Consider the problem of email spam.
– Consider the data which consists of information from 4601 email messages.
– The objective was to design an automatic spam detector that could filter out spam before clogging the users’ mailboxes.
– For all 4601 email messages, the true outcome (email type) email or spam is available, along with the relative frequencies of 57 of the most com- monly occurring words and punctuation marks in the email message.
– This is a classification problem.
DNA Expression Microarrays
DNA is the basic material that make up human chro- mosomes.
• The regulation of gene expression in a cell begins at the level of transcription of DNA into mRNA.
• Although subsequent processes such as differential degradation of mRNA in the cytoplasm and differen- tial translation also regulate the expression of genes, it is of great interest to estimate the relative quan- tities of mRNA species in populations of cells.
• The circumstances under which a particular gene is up- or down-regulated provide important clues about gene function.
• The simultaneous expression profiles of many genes
can provide additional insights into physiological pro- cesses or disease etiology that is mediated by the coordinated action of sets of genes.
Spotted cDNA microarrays (Brown and Botstein 1999) are emerging as a powerful and cost-effective tool for large scale analysis of gene expression.
• In the first step of the technique, samples of DNA clones with known sequence content are spotted and immobilized onto a glass slide or other substrate, the microarray.
• Next, pools of mRNA from the cell populations un- der study are purified, reverse-transcribed into cDNA, and labeled with one of two fluorescent dyes, which we will refer to as “red” and “green.”
• Two pools of differentially labeled cDNA are com- bined and applied to a microarray.
• Labeled cDNA in the pool hybridizes to complemen- tary sequences on the array and any unhybridized cDNA is washed off.
The result is a few thousand numbers, typically rang- ing from say −6 to 6, measuring the expression level for each gene in the target relative to the reference sample. As an example, we have a data set with 64 samples (column) and 6830 genes (rows). The chal- lenge is to understand how the genes and samples are organized. Typical questions are as follows:
• Which samples are most similar to each other, in terms of their expression profiles across genes?
– Think of the samples as points in 6830-dimensional space, which we want to cluster together in some way.
• Which genes are most similar to each other, in terms of their expression profile across samples?
• Do certain genes show very high (or low) expression for certain cancer samples?
– Feature selection problem.
Statistical Language
For the examples we just described, they have several components in common.
• For each there is a set of variables that might be denoted as inputs, which are measured or present.
These have some influence on one or more outputs.
• For each example, the goal is to use the inputs to predict the values of the outputs. In machine learn- ing language, this exercise is called supervised learn- ing.
• In statistical literature the inputs are often called the predictors or more classically the independent vari- ables.
The outputs are called responses, or classically the
dependent variables.
• The output can be a quantitative measurement, where some measurements are bigger than others, and measurements close in value are close in nature.
– EX 1. Consider the famous Iris discrimination ex- amples (Fisher, 1936). In this data set, there are 150 cases with 50 cases per class. The output is qualitative (species of Iris) and assumes values in a finite set G = {Virginica, Setosa and Versicolor}.
There are four predictors: sepal length, sepal width, petal length, and petal width.
– EX 2. In the handwritten digit example, the out- put is one of 10 different digit class.
– In Ex1 and 2, there is no explicit ordering in the
classes, and in fact often descriptive labels rather than numbers are used to denote the classes.
Qualitative variables are often referred to as cate- gorical or discrete variables as well as factors.
– Ex 3. For given specific atmospheric measurements today and yesterday, we want to predict the ozone level tomorrow.
Given the grayscale values for the pixels of the digitized image of the handwritten digit, we want to predict its class labels.
– For all three examples, we think of using the in- puts to predict the output. The distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we
predict qualitative outputs.
– For regression and classification, both can be viewed as a task in function approximation.
– For qualitative variables, they are typically repre- sented numerically by codes.
• A third variable type is ordered categorical, such as small, medium and large, where there is an ordering between the values, but no metric notion is appro- priate (the difference between medium and small need not be the same as that between large and medium).
IRIS Data
First applied in 1935 by M. Barnard at the suggestion of R.A. Fisher (1936).
• Fisher linear discriminant analysis (FLDA): It con- sists of
– Find linear combination xTa of x = (x1, . . . , xp) to maximize the the ratio of “between group” and
“within group variances.
– xTa is called discriminant variables.
– Predicting the class of an observation x by the class whose mean vector is closest to x of the discriminant variables.
– Represent Class k by (µk, Σ).
– Define B0 = P3
k=1(µk − ¯µ)(µk − ¯µ)T.
– Identify eigenvalues and eigenvectors of Σ−1B0. maxa
aTB0a aTΣa .
– The problem of learning is that of choosing from the given set of functions xTa, the one which pre- dicts the supervisor’s response in the best possible way.
∗ How do we quantify it?
∗ How will the variability of ˆa affect the predic- tion?
• R-programming – data(iris)
– attach(iris)
– help.search(”discriminant analysis”)
– Linear Discriminant Analysis: lda
• Logistic regression – Model
log P (y = 1 | x)
1 − P (y = 1 | x) = α + xTβ
– The coefficients α and β need to be estimated iteratively (writing down the likelihood and finding the MLE).
The “scoring method” or “iterative weighted least squares.”
– Convert a classification problem to an estimation problem.
• Problem of risk minimization
– The loss or discrepancy: L(y, f (x|α))
Here y is the supervisor’s response to given in- put x and f (x|α) is the response provided by the learning machine.
– The risk functional: the expected value of the loss or
R(α) = Z
L(y, f (x|α))dρ(x, y).
– Goal: Find the function which minimizes the risk functional R(α) (over the class of functions f (x|α), α ∈ A, in the situation where the joint probabil- ity distribution is unknown and the only available information is contained in the training set.
– Pattern recognition
∗ The supervisor’s output y take on only two val-
ues {0, 1}.
∗ f (x|α), α ∈ A, are a set of indicator functions the (functions which take on only two values zero and one).
∗ The loss-function:
L(y, f (x|α)) = 0 if y = f (x|α), 1 if y 6= f (x|α),
∗ For this loss function, the risk functional pro- vides the probability of classification error (i.e., when the answers given by supervisor and the answers given by indicator function differ).
∗ The problem, therefore, is to find the function which minimizes the probability of classification errors when probability measure is unknown, but the data are given.
• Formulation of Discriminant Analysis
– Objective: Distinguish two classes based on the observed covariates (and training data).
– Data: {(x1, y1), . . . , (x`, y`)}, yi = 0 or 1.
– Goal: Make decision on Y based on X.
(x, y) has an unknown probability distribution ρ(x, y).
– The Bayes Solution: (minimum error rule)
P (Y = y | X) = P (X | Y = y)π(Y = y)
P (X | Y = 0)π(Y = 0) + P (X | Y = 1)π(Y = 1). It assumes that the cost of different type of mis-
classification are the same. (Costs may not be the same)
– Practical problems: Prior? Distributions?
Role of Optimization in Inference
Many important classes of estimators are defined as the point at which some function that involves the pa- rameter and the random variable achieves an optimum.
• Estimation by Maximum Likelihood:
– Concept intuitively
– Nice mathematical properties
– Give a sample y1, . . . , yn from a distribution with pdf or pmf p(y | θ∗), MLE of θ is the value that maximizes the joint density or joint probability with variable θ at the observed sample value: Q
i p(yi | θ).
• Likelihood function
Ln(θ; y) =
n
Y
i=1
p(yi | θ).
The value of θ for which Ln achieves its maximum value is the MLE of θ∗.
• Critique: How to determine the form of the function p(y | θ)?
• The data-that is, the realizations of the variables in pdf or pmf- are considered as fixed, and the param- eters are considered as variables of the optimization problem,
maxθ Ln(θ; y).
• Questions:
– Does the solution exist?
– Existence of local optima of the objective func- tion.
– Constraints on possible parameter values.
• MLE may not be a good estimation scheme.
• Penalized maximum likelihood estimation
• Optimization in R
– nlm is the general function for “nonlinear mini- mization.”
∗ It can make use of a gradient and/or Hessian, and it can give an estimate of the Hessian at the minimum.
∗ This function carries out a minimization of the function f using a Newton-type algorithm.
∗ It needs starting parameter values for the mini- mization.
∗ See demo(nlm) for examples.
– Minimize function of a single variable
∗ optimize: It searches the interval from ’lower’
to ’upper’ for a minimum or maximum of the function f with respect to its first argument.
∗ uniroot: It searches the interval from ’lower’ to
’upper’ for a root of the function f with respect to its first argument.
– D and deriv do symbolic differentiation.
• Example 1: Consider x ∼ mult(n, p) where p = ((2 + θ)/4, (1 − θ)/2, θ/4).
– Suppose we observe the data x = (100, 94, 6).
– The negative log likelihood is ln n! − (X
i
ln ni!) + x1 ln((2 + θ)/4) + x2 ln((1 − θ)/2) +x3 ln(θ/4).
– In R, it can be written as
nll < −f unction(theta, x) − sum(x ∗ log(c((2 + theta)/4, (1 − theta)/2, theta/4)))
– Using the nlm function, we get
nlm(nll, 0.1, typsize = 0.01, hessian = TRUE, x = c(100, 94, 6))
– We have ˆθ = 0.104 with an estimated SE ≈ 1/√
689.7 ≈ 0.04.
EM Methods
• Goals: Provide an iterative scheme for obtaining maximum likelihood estimates, replacing a hard prob- lem by a sequence of simpler problems.
• Context: Though most apparent in the context of missing data, it is quite useful in other problems as well.
The key is to recognize a situation where, if you had more data, the optimization would be simplified.
• Approach: By augmenting the observed data with some additional random variables, one can often convert a difficult maximum likelihood problem into one which can be solved simply, though requiring iteration.
– Treat the observed data Y as a function Y = Y(X) of a larger set of unobserved complete data X, in effect treating the density
g(y; θ) = Z
X (y)
f (x; θ)dx.
The trick is to find the right f so that the result- ing maximization is simple, since you will need to iterate the calculation.
• Computational Procedure: The two steps of the cal- culation that give the algorithm its name are
1. Estimate the sufficient statistics of the complete data X given the observed data Y and current parameter values,
2. Maximize the X-likelihood associated with these
estimated statistics.
• Genetics Example (Dempster, Laird, and Rubin; 1977):
Observe counts
y = (y1, y2, y3, y4) = (125, 18, 20, 34)
∼ M ult(1/2 + θ/4, (1 − θ)/4, (1 − θ)/4, θ/4) where 0 ≤ θ ≤ 1.
– Estimate θ by solving the score equation y1
2 + θ − y2 + y3
1 − θ + x4
θ = 0.
It is a quadratic equation in θ.
– Think of y as a collapsed version (y1 = x0 + x1) of
x = (x0, x1, x2, x3, x4) ∼ M ult(1/2, θ/4, (1−θ)/4, (1−θ)/4, θ/4).
Step 1. Estimate x0 and x1 given y1 = 125 and an estimate θ(i) implies that
x(i)0 = 125 1/2
1/2 + θ(i)/4 and x(i)1 = 125 θ(i)/4
1/2 + θ(i)/4. Conditional distribution of X1 given X0 + X1 = 125 is
Bin
125, θ/4 1/2 + θ/4
.
Step 2. Maximize the resulting binomial problem, obtain- ing θ(i+1) = (x(i)1 + 34)/(x(i)1 + 18 + 20 + 34).
Given the complete data, MLE of θ is θ =ˆ x1 + x4
x1 + x2 + x3 + x4.
– Starting from an initial value of 0.5, the algorithm moved for eight steps as following: 0, 608247423,
0, 624321051, 0.626488879, 0.626777323, 0.62677323, 0.626815632, 0.626820719, 0.626821395, 0.626821484.
– If E − step is hard,
∗ replace it by Monte Carlo approach (MCEM)
∗ Wei and Tanner (1990, JASA)
• Mixture models:
Suppose that the observed data Y is a mixture of samples from k populations, but that the mixture indicators Ymiss are unknown.
Think of Ymiss = (0, 0, . . . , 0) as a k-vector with one position one and the rest zero.
– The complete data is X = (Y, Ymiss).
Step 1. Estimate the group membership probability for each Yi given the current parameter estimates.
Step 2. Maximize the resulting likelihood, finding in effect the weighted parameter estimates.
• References
– Dempster, A.P., N.M. Laird, and Rubin, D.R. (1977).
Maximum likelihood from incomplete data via the EM algorithm. JRSS-B, 39, 1-38.
– Little, R.J.A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data. Wiley, New York.
– Tanner, M.A. (1993). Tools for Statistical Infer- ence. Springer, New York.
• Success?
Theory shows that the EM algorithm has some very appealing monotonicity properties, improving the like- lihood at each iteration.
Though often slow to converge, it does get there!
EM Methods: Motivating example Consider the example on ABO blood groups.
Genotype Phenotype Gen freq
AA A p2A
AO A 2pApO
BB B p2B
BO B 2pBpO
OO O p2O
AB AB 2pApB
• The genotype frequencies above assume Hardy-Weinberg equilibrium.
• Imagine we sample n individuals (at random) and observe their phenotype (but not their genotype).
• We wish to obtain the MLES of the underlying allele
frequencies pA,pB, and pO.
• Observe nA = nAA + nAO, nB = nBB + nBO, nO = nOO, and nAB, the numbers of individuals with each of the four phenotypes.
• We could, of course, form the likelihood function and find its maximum. (There are two free param- eters.) But long ago, RA Fisher (or others?) came up with the following (iterative) “allele counting”
algorithm
Allele counting algorithm:
Let nAA, nAO, nBB, and nBO be the (unobserved) numbers of individuals with genotypes AA, AO, BB, and BO, respectively.
Here’s the algorithm:
1. Start with initial estimates pˆ(0) = ( ˆp(0)A , pˆ(0)B , pˆ(0)O ).
2. Calculate the expected numbers of individuals in each of the genotype classes, given the observed numbers of individuals in each phenotype class and given the current estimates of the allele frequencies.
For example:
n(s)AA = E(nAA | nAA, pˆ(s−1))
= nApˆ(s−1)A /( ˆp(s−1)A + 2 ˆp(s−1)O ).
3. Get new estimates of the allele frequencies, imagin-
ing that the expected n0s were actually observed.
ˆ
p(s)A = (2n(s)AA + n(s)AO + nAB)/n ˆ
p(s)B = (2n(s)BB + n(s)BO + nAB)/n ˆ
p(s)O = (n(s)AO + n(s)BO + 2nO)/n.
4. Repeat steps (2) and (3) until the estimate con- verges.
EM algorithm
Consider X ∼ f (x | θ) where X = (Xobs, Xmiss) and f (xobs | θ) = R f (xobs, xmiss | θ)dxmiss.
• Observe xobs but not xmiss.
• Wish to find the MLE ˆθ = arg maxθ f (xobs | θ).
• In many cases, this can be quite difficult directly, but if we had observed xobs, it would be easy to find
θˆC = arg max
θ f (xobs, xmiss | θ).
EM algorithm:
E step:
`(s)(θ) = E{log f (xobs, xmiss | θ) | xobs, ˆθ(s)}
M step:
θˆ(s+1) = arg max
θ `(s)(θ) Remarks:
• Nice property: The sequence `[ˆθ(s)] is non-decreasing.
• Exponential family: `(θ | x) = T (x)tη(θ) − B(θ).
– T (x) are the sufficient statistics.
– Suppose x = (y, z) where y is observed and z is missing.
E step: Calculate W(s) = E{T (x) | y, ˆθ(s−1)}
M step: Determine ˆθ(s) solving E{T (x) | θ} = W(s).
– Refer to Wu (1983, Ann Stat 11:95-103) on the convergence of EM algorithm.
Normal Mixtures
Finite mixtures are a common modelling technique.
• Suppose that an observable y is represented as n observations y = (y1, . . . , yn).
• Suppose further that there exists a finite set of J states, and that each yi is associated with an unob- served state.
Thus, there exists an unobserved vector q = (q1, . . . , qJ), where qi is the indicator vector of length J whose components are all zero except for one equal to unity indicating the unobserved state associated with yi.
• Define the complete data to be x = (y, q).
A natural way to conceptualize mixture specifications is to think of the marginal distribution of the indicators
q, and then to specify the distribution of y given q.
• Assume that the yi given qi are conditionally inde- pendent with densities f (yi | qi).
• Consider x1, . . . , xn iid∼ PJ
j=1 pjf (xi | µj, σ) where f (· | µ, σ) is the normal density.
(Here we put the SD rather than the variance here.)
• Let
yij = 1 if xi is drawn from N (µj, σ) 0 otherwise
so that P
j yij = 1.
(xi) is the observed data; (xi, yi) is the complete data.
• The following is the unobserved Complete data log likelihood
`(µ, σ, p | x, y) = X
i
X
j
yij{log pj + log f (xi | µj, σ)}
– How do we estimate it?
– Sufficient statistics S1j = X
i
yij S2j = X
i
yijxi S3j = X
i
yijx2i – E step
wij(s) = E[yij | xi, pˆ(s−1), µˆ(s−1), σˆ(s−1)]
= P r[yij = 1 | xi, pˆ(s−1), µˆ(s−1), σˆ(s−1)]
=
ˆ
p(s−1)j f (xi | ˆµ(s−1)j , σˆ(s−1)) P
j pˆ(s−1)j f (xi | ˆµ(s−1)j , σˆ(s−1))
Hence
S1j(s) = X
i
wij(s), S2j(s) = X
i
wij(s)xi, S3j(s) = X
i
wij(s)x2i. – M step
ˆ
p(s)j = S1j(s)/n, ˆ
µ(s)j = S2j(s)/S1j(s), ˆ
σ(s) =
s
X
j
n
S3j(s) − [S2j(s)]2S1j(s) o
/n
Issues I. Stopping rules
1. | `( ˆθ(s+1)) − `( ˆθ(s)) |< for m consecutive steps.
This is bad! ` may not change much even when θ does.
2. k ˆθ(s+1) − ˆθ(s)k < for m consecutive steps.
This runs into problems when the components of θ are of quite different magnitudes.
3. | ˆθj(s+1) − ˆθj(s) |< 1(| ˆθj(s) | +2) for j = 1, . . . , p In practice, take
1 = √
machine ≈ 10−8 2 = 101 to 1001 II. Local vs global max
– There may be many modes.
– EM may converge to a saddle point Solution: Many starting points
III. Starting points
– Use information from the context
– Use a crude method (such as the method of mo- ments)
– Use an alternative model formulation IV. Slow convergence
The EM algorithm can be painfully slow to converge near the maximum.
Solution: Switch to another optimization algorithm when you get near the maximum.
V. Standard errors
– Numerical approximation of the Fisher informa- tion (ie, the Hessian)
– Louis (1982), Meng and Rubin(1991)
Probability statements in Statistical Inference
In hypothesis testing, the inferential methods depend on probabilities of two types of errors. In confidence intervals the decisions are associated with probability statements about coverage of the parameters. For both cases the probability statements are based on the distribution of a random sample, Y1, . . . , Yn.
• Simulation of data generating process. (Statistical experiment)
• Monte Carlo expectation
• Study the random process