I thank the authors for reminding us that inference from designed experiments is about choice of models as well as about estimating parameters. In their general set-up there are two models, M1and M2, with M2
contained in M1. If there are n data, then there are n – dim(M1) degrees of freedom for pure error, and dim(M1) – dim(M2) degrees of freedom for lack of fit. This approach respects marginality (Nelder, 1977, 1994), and avoids the problems caused by routinely pooling small mean squares (Janky, 2000).
However, I do not think that they go far enough. Designed experiments typically have far more potential linear models for the expectation of responses. For example, Fig. 3 shows the collection of models, and their relationships, in the comparatively simple case of a factorial experiment with two three-level quantitative factors.
How do we analyse data? We start with a family of models and their marginality relations, like those in Fig. 3. We use F -tests to choose the smallest model which explains the data adequately. Then we estimate the parameters of the chosen model. We need an optimality criterion which takes account of the whole process.
I have two small comments about the designs for experiments conducted in blocks (Section 3.3). It is not clear whether the authors’ algorithm includes blocks from the beginning or whether it first finds the best treatment design and then blocks it. The latter method can produce designs which are far from optimal.
If blocks are random, the authors recommend that we should construct the design as if they are fixed. An alternative is to find a design which is good throughout a plausible range of the ratio of stratum variances (Bailey, 1999).
Timothy Waite and David Woods (University of Southampton)
We thank the authors for their paper, which has made us think more deeply about optimality criteria and their justification.
Theoretical justifications
Traditional D-optimality and the authors’ new DP-optimality both have statistical justifications when the submodel (2) is true, as the confidence regions on which they are based are both valid forβ in this case.
Fig. 3. Collection of expectation models for a factorial experiment with two three-level quantitative factors
Moreover, the confidence region from using the pooled error estimate will typically be smaller than that obtained by using the pure error estimate, and so inference will be more efficient in the former case.
In practice, we can never be completely confident that the submodel (2) is correct. However, suppose that we can make the weaker assumption that the full treatment model (1) holds. Then
Rp:e:= {β : . ˆβ − β/T.XTX/. ˆβ − β/ ps2Fp, d;1−α},
whose volume is minimized by a DP-optimal design, is in fact a valid 100.1 − α/% confidence region for the vectorβsof pseudotrue values of the submodel parameters. These are the values ofβ in submodel (2) which are ‘least bad’ (Davison (2003), pages 147–148) and are given by
βs= .XTX/−1XTE.Y/,
where the expectation is taken with respect to the true (full treatment) model. This robustness of the inference based on pure error gives us a more rigorous reason to adopt DP- and similar criteria.
Compromise designs
If we believe that the full treatment model (1) holds and are interested in whether the submodel (2) is an appropriate approximation, a compromise design incorporating model uncertainty might be a more real-istic comparator for the .DP/S-design rather than the naive DS-optimal design. Let us assign probabilityγ that model (2) is true, assuming that model (1) is true otherwise. Then we can find the compromise design ξCwhich maximizes the compound objective function
Φ.ξ/ = γ log{|.M−1/22|−1=p2} + .1 − γ/log.|Mμ|1=t/ where Mμis the information matrix for model (1), and
|Mμ|1=t=t
i=1
n1=ti : See Läuter (1974) and Atkinson and Cox (1974) for related criteria.
The optimal design forγ = 0 is the balanced design for the full treatment model, and for γ = 1 we obtain the DS-optimal design. Clearly for 0 < γ < 1, the second term in Φ.ξ/ will act as a penalty which encourages replication.
We found designs for example 1. For 0 < γ 0:67, ξC = ξ.DP/S, and, forγ = 0:7, ξC = ξ.AP/S. For γ = 0:8, 0:9, ξCis highly DSefficient (99%) but with df(PE)= 2. Thus explicit acknowledgement of model uncertainty in the framework of traditional criteria can result in designs similar to those proposed by the authors and allows a compromise between experiment objectives.
Mervyn Stone (Ruislip)
This paper foresees a healthy future for its compound design criteria, thereby defying the now historical predictions that the Third Millennium would be Bayesian for everything. The authors refer to run-to-run variation but would, it seems, treat each run as a one-off experiment to be addressed by ‘classical’ infer-ence tools. They do not consider the (strictly un-Bayesian) empirical Bayes approach, which would use those tools to estimate a ‘prior’Π for the full parameterization (including β) of any sequence of runs, and deliver a multipurpose ‘posterior’ forβ from the results of the chosen design for each run. A unique design criterion can be defined as the gainΔI in expected Shannon information I about β. ΔI is invariant under 1–l transformation ofβ and may appeal to any classical practitioner who cannot decide on the weights κ1,κ2,κ3andκ4.
Consider the simple empirical Bayes case of an error varianceσ2 assumed constant (either known or reliably estimated from ‘pure’ replication in a large number of runs) rather than varying from run to run (and taken intoΠ). Suppose that Y is N.Xβ, Iσ2/ and that β is N.a, Aσ2/. The prior value of I is
log{π.β/}dΠ.β/ and the expected posterior value is
log{π.β|Y/=π.β/} dΠ.β|Y/dP.Y/ where π is density and P is the marginal distribution of Y underΠ. By Stone (1959), the gain in I is ΔI =
1
2log|I + AXTX| =12log.S/, say. The determinant S shows the influence of the estimated ‘prior’ and also reminds us that ‘ignorance priors’ have no place in experimental design, since such priors would here dictate an infinity in A that could dominate the choice of X.
Example
Suppose that E.Y|β/ = α + βx + γx2 and A= diag.u, v, w/. With n = 4, design A is the three-point 1–2–1 design on x = 0, 1, 2. The superficially riskier design B, 2–0–2, is willing to leave the centre point unobserved. My algebra gives the respective S-values as 1+4u+6v+18w +8uv+36uv+8vw +8uvw and 1+4u+8v+32w+64uw+16uv. The difference in favour of design A is 2w.4uv+4v−14u−7/−2v−8uv, which is negative for w small. Sufficiently small w=v may be seen as knowledge that γ is sufficiently close to 0 for curvature to be ignored relative to linearity; if u = v = 10, this happens when w=v < 82=386 ≈ 1=5.
However, if 4v < .14u + 7/=.u + 1/, there is no value of w that would favour 1–2–1 over 2–0–2—which is food for thought (for significance testers) when w=v is large.
Frank Critchley (The Open University, Milton Keynes)
It is a pleasure to welcome this paper, with its emphasis on optimal design in practice, in particular, its argument for, where possible, separating lack of fit from true error. Overall, the paper illustrates strongly the truism that the way a problem is formulated affects—focuses, limits, (potentially) prejudges, . . . —the form or outcomes of its subsequent analysis. This is certainly so for the compound criteria of the paper and raises the following questions.
(a) What are the effects—intended, or otherwise—of different possible choices of such criteria?
(b) In particular, what are the effects of the multiplicative form used?
(c) In this case, positive power transforms mapping .0, 1/ to .0, 1/, is there—if not a full implied utility function—at least an implied set of non-linear scales on which trade-offs between (transformed) efficiencies can be made? If so, are these interpretable, in some way?
(d) Collecting the exponents into a vectorκ, what more can be said—either theoretically or practi-cally—about how aκ-optimal design varies with κ? Where are the (discrete) discontinuities? Is some form of sensitivity analysis possible?
Finally, I have two questions about possible extensions.
(a) Throughout, the paper assumes a common variance across observations. In common with earlier discussants, I wonder how far heterogeneity could be accommodated.
(b) The methodology presented is invariant to overall scale changes in X. If or where appropriate, is (some form of) affine invariance achievable?
It will be clear that I warmly welcome both tonight’s paper and the wide discussion and further devel-opments that it will undoubtedly bring. I thank you for it.
Martin Ridout (University of Kent, Canterbury)
I have enjoyed reading this interesting and carefully argued paper. One issue that the authors do not dis-cuss is robustness of their designs to missing values. For the new design criteria that are considered here, a missing value may change the pure error degrees of freedom as well as modifying the X -matrix.
In small designs, even a single missing value can cause the design to break down in the sense that the matrix XTX becomes singular. In Table 1, for example, design IV breaks down if any of the first four (corner) points is missing and design II breaks down if any of the design points 5, 8, 15 or 16 is lost.
Even in larger designs, loss of a single point can have a substantial effect. For example, in Table 5, design IV is (AP)Soptimal but, if design point 10 is lost, the (AP)S-efficiency drops to 0.793, below the corres-ponding minimal (AP)S-efficiency of 0.852 for design V (Table 8), which arises when point 37 is missing.
A modification of design IV that replaces the original design points 34 and 36 by replicates of points 1 and 10 is more robust. The (AP)S-efficiency of the full design is 0.992 and the (DP)S-efficiency is 0.954, which is slightly better than design IV. The lowest (AP)S-efficiency following loss of a single point is 0.865.
So I wonder whether it would be useful to incorporate some measure of efficiency in the presence of missing values into composite design criteria, for situations where missing values might occur and where it is difficult to repeat missing experimental runs. In several of the designs in this paper, the loss of efficiency from a single missing value is much greater for a small subset of points than for the majority of points and designing to minimize the maximum loss (e.g. Ahmad and Gilmour (2010)), although natural, might be unduly cautious. From a practical point of view, it may be worth alerting the experimenter to any critical design points, in case it is possible to take extra precautions to avoid losing these runs.
Ben Torsney (University of Glasgow)
The authors are to be commended for raising the profile of experimental design and to do so by extending, not regressing, the subject. They do themselves an injustice to suggest ‘that the D-criterion has no place in the design of . . . experiments’. This is for two main reasons.
(a) On the positive side they have extended standard criteria to df(PE;LoF)-dependent criteria which are on a par with parameter-dependent non-linear local design criteria.
(b) On the negative side, the new criteria rely heavily on the assumption of normality, whereas the standard criteria can be viewed as focusing on the properties of ordinary least squares estimation, although they have interpretations under normality. There is no avoiding this. If there is no nor-mality, there are no degrees of freedom. If it is argued that we can be relaxed about the issue, then surely this also applies to standard criteria. Conversely if normality is seen to be crucial, but is in doubt, then it may be safer to resort to standard criteria.
Moreover there may be further extended roles for these criteria; for example, optimize one subject to a
‘start-up’ design which ensures a given df(PE;LoF) structure. This compares with a design which is optimal for a second stage, given a first-stage design; see Covey-Crump and Silvey (1970). Designing subject to constraints is not new.
The authors’ designs can be viewed as made up of two component designs: one for pure error or lack of fit and one for parameter inference. One might opt for a D-optimal design for the latter, subject to a given design for the former, or maybe vice versa. For example, in the (AP)S-design III of Table 1, do the six singly replicated design points define a constrained optimal design, given the five doubly replicated designs? Possibly the two component designs can be chosen jointly optimal for a given number of runs for each—another compound criterion.
Finally it is the variance term which has given rise to the authors’ new criteria. What extensions might there be for mixed effects models with several variance terms? Of course in generalized linear models, there is no variance term. So standard design criteria can reign supreme for the time being. I hope that this is of as much comfort to the authors as it is to me! I thank them for their stimulating paper.
The following contributions were received in writing after the meeting.
Christine M. Anderson-Cook and Lu Lu (Los Alamos National Laboratory)
We commend the authors on an excellent paper and for challenging the commonly accepted assertion that D-optimality is the appropriate metric for constructing optimal designs. We agree that this single-minded strategy often leads to undesirable solutions, unless augmented (often arbitrarily) to compensate for other properties. Devoting design resources to obtaining a model-independent estimate of error variance is an important priority which has not been consistently articulated in the literature.
The new metrics proposed arise naturally from common design analyses and seek to balance efficient esti-mation of model parameters, unbiased estiesti-mation of error variance and the ability to assess lack of fit. They help to broaden the focus away from exclusively D-optimality in a helpful way. Lu and Anderson-Cook (2012) propose alternative metrics for considering the same three objectives in design construction. Our
(a) (b) 0.95
0.63
0.32
0.00 2
3 4
D-eff LOF
PEDF
D-eff
LOF
7.2
4.8
2.4
0.0
0.8
0.8 0.8
0.6
0.6 0.6
0.4 0.4
0.4
0.2 0.2 2 5
4
3
1 0.2
(c)
100%
75%
50%
25%
0%
deff
pedf lof
1 4 5
Design#
3 2
5
1.000.670.330.00 1.000.670.330.00
dfPE
Fig. 4. Set of graphical tools for facilitating decision making by showing the trade-off between criteria values for promising designs, robustness of best designs to different weight combinations of the three criteria and relative efficiency of an individual design’s desirability compared with the best possible for different weightings: (a) trade-off plot (
, D-efficiency;, pure error degrees of freedom; , lack of fit); (b) mixture plot; (c) synthesized efficiency plotmetrics are simpler than those proposed by the authors but also seek to capture these three key elements of a good design, which we contend represent a common set of priorities for many screening design situations.
The compound criteria designs proposed by Gilmour and Trinca nicely moderate the extremes found by focusing on a single objective. We understand that any combination of weights could have been used for the optimization, but we consider one or two particular choices to oversimplify what is needed. A key aspect of design construction and selection is how to combine the metrics! We favour keeping the metrics separate and using a Pareto front optimization approach (Lu et al., 2011) to assemble a suite of candidate designs which are best for different weightings of the metrics. In our experience, specification of relative weightings of different aspects of goodness is usually quite subjective and imprecise. Although compu-tationally somewhat more demanding than optimizing a single objective, the Pareto approach allows for this user uncertainty to be flexibly explored. Decision making when balancing different criteria feels most natural when a set of promising designs is assembled for comparison and then their relative merits are evaluated numerically and graphically. Fig. 4 shows several graphics developed in Lu and Anderson-Cook (2012) and Lu et al. (2011) to help experimenters to examine trade-offs between candidates (Fig. 4(a)), to identify best designs for different weight combinations (Fig. 4(b)) and to evaluate the synthesized efficiency of any potential final design relative to the best among all competitors (Fig. 4(c)).
We wholeheartedly endorse considering multiple aspects of design goodness but feel that there are several good options for how to proceed from this starting point.
Alexis Boukouvalas and Dan Cornford (Aston University, Birmingham)
We thank the authors for a thought-provoking paper. Our work has been focusing on optimal design for
estimation of covariance parameters of Gaussian processes (Boukouvalas, 2011) and is based on the work of Zhu and Stein (2005).
In the correlated error case, there is no general result under infill asymptotics to prove the optimality of using the log-determinant of the Fisher information matrix as a design criterion. In our work we have empirically observed that, under complex noise models, the approximation error of the Fisher information matrix to the log-determinant of the covariance of maximum likelihood parameter estimates increases as the ratio of single to multiple replicated points is increased.
Do the authors believe that their paper can be extended to this domain and to help to explain the increase in the approximation error? Do they believe that this is due to biased estimation of the covariance parameters?
Chris Brien (University of South Australia, Adelaide)
I applaud the authors for developing the new optimality criteria. They are essential when using software packages to search for designs and the proposed compound criteria are particularly appealing for their ability to deal with multiple objectives.
However, I believe that there are two distinct issues underlying Section 2:
(a) the need for a pure error estimate and
(b) the desirability of using a pooled versus a pure estimate of the error variance.
In terms of (a), a good estimate of pure error is required for testing lack of fit, at least, and the new
‘P’-criteria help to ensure that it is available. As for (b), an objective approach exists in the form of deciding between ‘never-pool’, ‘sometimes-pool’ and ‘always-pool’ procedures for estimating the error variance.
Janky (2000) has provided a technical review of the area and Hines (1996) a review of the practice. The always-pool procedure is generally not considered viable.
Relevant to the paper is the pooling of fixed effects with a pure error estimate, which both Mead et al.
(1975) and Hines (1996) addressed. Rather than the never-pool option that the authors settle on, the rec-ommendations of Mead et al. (1975) appear, on balance, to have worthwhile benefits. Their investigation of the size and power of the competing procedures suggests that, in the context of the paper,
(i) the sometimes-pool procedure should only be used if the lack-of-fit degrees of freedom are about equal to or considerably larger than the pure error degrees of freedom and
(ii) the never-pool procedure is recommended if the lack-of-fit degrees of freedom are smaller than the pure error degrees of freedom, and the latter are reasonably large.
The authors generally commend designs with approximately equal pure error and lack-of-fit degrees of freedom. On the basis of Mead et al. (1975), the sometimes-pool procedure, with a preliminary test for lack of fit that usesα = 0:50, can be expected to provide useful gains in power for such designs. Design VI for example 4 and design VI for example 6 fit into this category. Similarly, the sometimes-pool procedure would be used for exercise 11.6 of Box and Draper (2007) discussed in Section 2; because the p-value for the lack-of-fit test is 0.73 and so greater than 0.50, the use of s2pis indicated. However, the pooled estimate would never be used with design IV for example 5 because of the very small lack-of-fit degrees of freedom.
Elvan Ceyhan (Koç University, Istanbul)
I congratulate the authors for this interesting paper on optimal experimental design, especially, for devel-oping compound criteria which incorporate multiple objectives.
Kiefer (1975) introduced the concept of universal optimality and defined the conditions for a design to be universally optimal. Kiefer’s optimality conditions were generalized by Yeh (1986) for binary block designs. Kiefer’s optimality criteria include many criteria such as D- and A-optimality among others. One wonders whether the criteria introduced in the paper (such as DP and AP) are Kiefer universally optimal.
The authors write ‘quite different designs can be optimal under the new criteria’. This is expected, because all traditional criteria agree on several design classes according to Kiefer’s optimality, which suggests that optimality is more robust to the changes in optimality criteria, compared with changes in the model.
The authors use Bonferroni adjustment, to correct for multiple testing; other correction methods (e.g.
˘Sidák’s correction) could also be employed, as the Bonferroni method is known to cause a reduction in power. Letβ be the level of significance for each of n tests; for experimentwise error rate α, the Bonferroni method suggestsβB=α=n, whereas ˘Sidák’s method suggests βS=1−.1−α/1=n. However, ˘Sidák’s method requires independence of tests, and the difference seems negligible (for example, with n = 9 as in example 1 andα = 0:05, βB≈ 0:0056 and βS≈ 0:0057). In this context a single-step correction is more sensible, and a step-down approach like Holm’s (1979) correction may not be immediately applicable.
To form a compound criterion, perhaps the desirability function of Harington (1965) can also be employed (with some modifications) by transforming the functions to a common scale in [0,1]. Then the transformed functions are combined and optimized as the overall metric. Desirability functions are popular in response surface methodology and have previously been applied in experimental design (see, for example, Rafati and Mirzajani (2011) and Azharul Islam et al. (2009)). For example, the efficiency functions in Section 4.1, Ei, i = 1, 2, 3, 4, can be transformed to functions diwhose range is [0,1] (and the value of diincreases as the desirability of the function increases). For example, if a response is to be maximized, the desirability function
di.Ei/ =
⎧⎪
⎨
⎪⎩
0 if di.Ei/ < Ai,
di.Ei/ − Ai
Bi− Ai
wi
if Ai di.Ei/ < Bi,
1 if di.Ei/ Bi,
.7/
can be used, where Ai, Biand wiare chosen by the researcher. So, the goal is to optimize D = .d1d2d3d4/1=Σwi, which is very similar to equation (5) in the text. For desirability functions where the response is ‘minimized’
or the ‘target is best’ is needed, see, for example, Derringer and Suich (1980).
Marion J. Chatfield (GlaxoSmithKline, Stevenage)
In the pharmaceutical industry experimental design is used to develop, understand and validate processes.
Optimal design is becoming increasingly important, as it allows the designer to cater for aspects such as constrained regions, specific models and flexibility in the number of runs.
I welcome this paper as another step towards providing optimal designs which meet the practical needs of industry because
(a) it is usually desirable to have a pure error estimate (excepting screening of many factors) and (b) better composite criteria could produce appropriate optimal designs with reduced designer time
and knowledge.
Industrial application usually seeks a pure error estimate, albeit often with poor precision. Although Bayesian analysis is rarely applied (and beyond most scientists), informally an experimenter compares their prior expectations with estimated fixed effects, model lack of fit and the variability observed. The pure error estimate is used to assess lack of fit against, and to signal when variation is higher than expected (indicating a problem with the experimentation). Classical designs which include repeat points (unlike D-optimal designs) are popular, though repeats are typically centre points. The ability to spread repeats through the design region, as observed by using the DP.α/ criterion, would be desirable, providing more protection against missing or rogue observations and an ‘average’ level of the background variation (which is useful in the case of heterogeneity).
‘Efficiency’, in industry, includes the time to produce the design, to plan and implement the experimental work, to analyse and interpret the results, the quality and use of the information gained. Classical designs are often used, especially by trained scientists, as they seem ‘safe’ designs, incorporating many desirable properties (Box and Draper, 1975) and are relatively quick to design and analyse (detailed evaluation not required and easily analysed in commercial software). Appropriate composite criteria making optimal design more accessible and encompassing are desirable.
I have the following specific comments.
(a) Often in small designs lack of fit is combined with pure error to improve power, given prior expec-tation that variation is small and inference is just one objective. The consequent unknown bias in the ‘error’ estimate is perceived a reasonable risk to reduce resource. The authors’ strong recom-mendation to base inference on pure error, and not taking the risk of bias in exercise 11.6 of Box and Draper (2007), seems practically too severe.
(b) Compound criteria including lack-of-fit estimating some higher order parameters (e.g. DuMouchel and Jones (1994) and Jones and Nachtsheim (2011)) would be useful.