Introduction to Bayesian Statistics
Lecture 10: Model Checking
Rung-Ching Tsai
Department of Mathematics National Taiwan Normal University
The role of model checking
• Three common steps in Bayesian modeling
◦ specify the prior based on historical data or substantive knowledge
◦ construct a reasonable probability model
◦ compute the posterior distribution of model parameters or posterior predictive distribution of future observations
• Yet another crucial step: Model checking
◦ assessing the adequacy of the fit of the model to the data and to our substantive knowledge.
◦ investigating what aspects of reality are not captured by the model.
◦ checking of the adequacy of the plausibility of the model for the purposes for which the model will be used.
2 of 15
What is to be checked?
• Idea: check “ALL” aspects of the “model”
• Why? Prior-to-Posterior inferences involve the whole structure (with hierarchies) of the Bayesian model can produce spurious inference if the model is poor.
◦ sensitivity analysis to thepriordistribution
◦ the appropriateness of thelikelihoodmodel (i.e., the sampling distribution)
◦ anyhierarchical structure
◦ other issues such as whichexplanatory variables or covariatesshould have been included in a model
◦ Any particularfeature of the dataone wishes to capture
Sensitivity Analysis
• Basic Idea: Since it is typically the case that more than one reasonable probability model can provide an adequate fit to the data in a scientific problem. Sensitivity analysis aims to examine how much do posterior inferences change when other probability models are used in place of the present model?
◦ Do the inferences from the model make sense?
◦ Is the model consistent with the data? Posterior predictive checking
◦ How can we compare or rank different plausible models, including the prior and likelihood etc, in their order of preference with respect to a given data set?
4 of 15
Do the inferences from the model make sense?
• In any applied problem, there will be knowledge that is not included formally in either the prior distribution or the likelihood, for reasons of convenience or objectivity. If the additional information suggests that posterior inferences of interest are false, then this suggests a potential for creating a more accurate probability model for the parameters and data collection process.
Is the model consistent with the data? Posterior predictive checking
• If the model fits, then replicated data generated under the model should look similar to observed data.
• The observed data should look plausible uner the posterior predictive distribution. Therefore, one needs to check whether an observed discrepancy can be due to model misfit or chance.
• Basic technique is to draw simulated values from the posterior predictive distribution of replicated data and compare these samples to the observed data.
• Any systematic differences between the simulations and the data indicate potential failings of the model.
6 of 15
How can we compare (or rank) different plausible models?
• Model expansion
• Model comparison
Posterior predictive checks (I)
• Let y = (y1, y2, · · · , yn)0 be the observed data and θ be the set of all parameters (including all hyperparameters) for a model
p(θ|y) ∝ p(θ) × p(y|θ).
• Let yrep= (yrep,1, yrep,2, · · · , yrep,n)0 be the replicated data that we would see if the experiment was replicated with the same model and the same value of θ that produced the observed data y.
• Replicated data yrep, like predictions ˜y, has two components of uncertainty:
p(yrep|y) = Z
p(yrep|θ)p(θ|y)d θ
◦ the fundamental variability of the model, represented by the posited variability in the data
◦ the posterior uncertainty in the estimation of θ
8 of 15
Posterior predictive checks (II)
• Test quantities T (y, θ)
◦ measure the discrepancy between model and data in the aspects of the data one wishes to check
◦ Test quantities play a role in Bayesian model checking that test statistics, T (y), play in classical testing. In classical statistics, the test statistic T (y) does not depend upon model parameters.
• Tail-area probability
◦ Lack of fit of the data regarding the posterior predictive distribution can be measured by the tail-area probability, or p-value of the test quantity.
◦ It is commonly computed using posterior simulations of (θ, yrep).
Posterior predictive checks (III)
• Classical p-values (for test statistic T (y)) pC = Pr(T (yrep) ≥ T (y)|θ),
◦ the probability is taken over the distribution of yrep with θ fixed.
◦ the test statistic T (y) does not depend upon model parameters.
• Posterior predictive p-values (for test quantity T (y, θ)) pB = Pr(T (yrep, θ) ≥ T (y, θ)|y)
= Z Z
IT (yrep,θ)≥T (y,θ)p(yrep|θ)p(θ|y)d yrepd θ
◦ test quantities can be a function of the parameters and the data because the test quantity is evaluated over draws from the posterior distribution of the unknown θ and yrep.
10 of 15
Example. Checking the assumption of independence in binomial trials (I)
• Consider a sequence of binary outcomes, y1, · · · , yn, modeled as iid Bernoulli trials
• uniform prior distribution on the probability of success, θ.
• the posterior density under the model is
p(θ|y) ∝ θs(1 − θ)n−s, with s =X yi.
• Data: 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0.
• The observed autocorrelation seems evidence that the model is
Example. Checking the assumption of independence in binomial trials (II)
• T (y) = 3
• draw θ from its posterior distribution, Beta(8,14).
• draw yrep= (yrep,1, yrep,2, · · · , yrep,n)0 as independent Bernoulli with probability θ.
• p-value=Pr(T (yrep, θ) ≥ T (y, θ)|y) ≈ 0.972
12 of 15
χ
2-type discrepancy measure
• Choose a discrepancy measure or test measure
T (y, θ) =
n
X
i =1
(yi− E [yi|θ])2 var(yi|θ)
• Compute T (y, θj) and the set of T (yrep,j, θj) and obtain the posterior predictive p-values:
PB = Pr(T (yrep, θ) > T (y, θ))
≈ 1
M
M
X1[T (yrep,j, θj) > T (y, θj)]
Interpreting posterior predictive p-values
• If the observed discrepancy measure has a tail-area probability close to 0 or 1, it implies that the observed pattern would be unlikely to be seen in replications of the data if the model were true.
• An extreme p-value implies that the model cannot be expected to capture this aspect of the data. If a p-value is close to 0 or 1, it is not so important exactly how extreme it is. A p-value of 0.00001 is virtually no stronger, in practice, than 0.001; in either case, the aspect of the data measured by the test quantity is inconsistent with the model.
• Major failures of the model, typically corresponding to extreme tail-area probabilities (less than 0.01 or more than 0.99), can be addressed by expanding the model in an appropriate way. Lesser failures might also suggest model improvements if the failure appears not to affect the main inferences.
• The p-value measures statistical significance, not practical significance.
14 of 15
Limitations of posterior tests
• Finding an extreme p-value and thus rejecting a model is never the end of an analysis; the departures of the test quantity in question from its posterior predictive distribution will often suggest
improvements of the model or places to check the data
• Conversely, even when the current model seems appropriate for drawing inferences, the next scientific step will often be a more rigorous experiment incorporating additional factors, thereby providing better data.