*To form a compound criterion, perhaps the desirability function of Harington (1965) can also be*
employed (with some modiﬁcations) by transforming the functions to a common scale in [0,1]. Then
the transformed functions are combined and optimized as the overall metric. Desirability functions are
popular in response surface methodology and have previously been applied in experimental design (see,
*for example, Rafati and Mirzajani (2011) and Azharul Islam et al. (2009)). For example, the efﬁciency*
functions in Section 4.1, Ei, i = 1, 2, 3, 4, can be transformed to functions diwhose range is [0,1] (and
the value of diincreases as the desirability of the function increases). For example, if a response is to be
maximized, the desirability function

di.Ei/ =

⎧⎪

⎨

⎪⎩

0 if di.Ei/ < Ai,

di.Ei/ − Ai

Bi− Ai

*w*_{i}

if Ai di.Ei/ < Bi,

1 if di.Ei/ Bi,

.7/

can be used, where Ai, Bi*and w*_{i}are chosen by the researcher. So, the goal is to optimize D = .d1d^{2}d^{3}d^{4}/^{1=Σw}^{i},
which is very similar to equation (5) in the text. For desirability functions where the response is ‘minimized’

or the ‘target is best’ is needed, see, for example, Derringer and Suich (1980).

**Marion J. Chatﬁeld (GlaxoSmithKline, Stevenage)**

In the pharmaceutical industry experimental design is used to develop, understand and validate processes.

Optimal design is becoming increasingly important, as it allows the designer to cater for aspects such as constrained regions, speciﬁc models and ﬂexibility in the number of runs.

I welcome this paper as another step towards providing optimal designs which meet the practical needs of industry because

(a) it is usually desirable to have a pure error estimate (excepting screening of many factors) and (b) better composite criteria could produce appropriate optimal designs with reduced designer time

and knowledge.

Industrial application usually seeks a pure error estimate, albeit often with poor precision. Although
Bayesian analysis is rarely applied (and beyond most scientists), informally an experimenter compares
their prior expectations with estimated ﬁxed effects, model lack of ﬁt and the variability observed. The
pure error estimate is used to assess lack of ﬁt against, and to signal when variation is higher than expected
(indicating a problem with the experimentation). Classical designs which include repeat points (unlike
*D-optimal designs) are popular, though repeats are typically centre points. The ability to spread repeats*
*through the design region, as observed by using the DP.α/ criterion, would be desirable, providing more*
protection against missing or rogue observations and an ‘average’ level of the background variation (which
is useful in the case of heterogeneity).

‘Efﬁciency’, in industry, includes the time to produce the design, to plan and implement the experimental work, to analyse and interpret the results, the quality and use of the information gained. Classical designs are often used, especially by trained scientists, as they seem ‘safe’ designs, incorporating many desirable properties (Box and Draper, 1975) and are relatively quick to design and analyse (detailed evaluation not required and easily analysed in commercial software). Appropriate composite criteria making optimal design more accessible and encompassing are desirable.

I have the following speciﬁc comments.

(a) Often in small designs lack of ﬁt is combined with pure error to improve power, given prior expec-tation that variation is small and inference is just one objective. The consequent unknown bias in the ‘error’ estimate is perceived a reasonable risk to reduce resource. The authors’ strong recom-mendation to base inference on pure error, and not taking the risk of bias in exercise 11.6 of Box and Draper (2007), seems practically too severe.

(b) Compound criteria including lack-of-ﬁt estimating some higher order parameters (e.g. DuMouchel and Jones (1994) and Jones and Nachtsheim (2011)) would be useful.

pro-ducing spuriously high or low apparent precision and their use quite sensitive to the often, but not always, unimportant normality assumption. How often is it that there is no external information about error variance? In traditional agricultural ﬁeld trials, where typically more degrees of freedom are available for error estimation, I believe the practice was to make informal comparison with the coefﬁcient of variation of yield anticipated from experience. In the present context estimates of variance, individually imprecise, might become available from a chain of related studies; what would the authors recommend in such cases?

One possibility would be a partially (empirical) Bayes approach in which estimation of the current variance would be from its posterior distribution, whereas the primary parameters would, unless there were good reason otherwise, be regarded as unknown constants.

Do the authors have any comments on experiments with a Poisson- or binary-distributed response, the role here of error estimation being the assessment of overdispersion?

**David Draper (University of California, Santa Cruz)**

The subject of this interesting paper—optimal experimental design for industrial research—would seem to be a good task for Bayesian decision theory, for which there are two approaches, involving

(a) non-adaptive and (b) adaptive designs.

In the cassava bread example, for instance, in case (a), consider initially a single organoleptic characteristic
such as ‘pleasing taste’, and let T > 0 be the mean value of this characteristic for wheat-based white bread
in the population of potential customers for the new cassava bread product. Each choice of the control
variables X = .X1, X2, X3*/ has associated with it a population mean θ*Xon the pleasing taste scale, which
*is estimated in an unbiased manner by each design point with control variable values equal to X ; letθ*
collect all the*θ*X-values (there are 3^{3}*=27 of them in the authors’ formulation). The action space A consists*
of vectors .a1, a2, . . . / of non-negative integers na_{i}= .n1, . . . , n27/a_{i} keeping track of the numbers of runs
made under action ai*at each of the 27 control variable settings. Any sensible utility function U.a, θ/ for*
this problem would have two components:

(i) a problem relevant measure of the distance D*θ** 0 between θ and T, and*
(ii) a measure of the total cost Ca*> 0 of all the runs made under strategy a.*

These would need to be combined into a single real-valued utility measure by trading off accuracy against
*cost, e.g. through U.a, θ/ =−{λD**θ**+.1−λ/C*a*} for a λ∈[0, 1] that is appropriate to the problem context; the*
optimal design then maximizes expected utility, where the expectation is over uncertainty quantiﬁed by the
posterior distribution for*θ. In the adaptive case (b), the Bayesian decision theoretic approach might go like*
*this: pick an X , make a run with the control variables set to that X , update the current posterior forθ with*
the new data on*θ*X, thereby generating a new posterior for D_{θ}*, choose a new X where the uncertainty about*
*θ*Xis largest and continue in this manner till you exhaust your experimentation budget. It would be
interest-ing to compare this approach with that of the authors, which appears to depend to a disturbinterest-ing extent on

‘whether the experimenters’ objectives will be met mainly through the interpretation of point estimates or through conﬁdence intervals or hypothesis tests’,

none of which would seem to be relevant when the problem is considered decision theoretically.

**Wenceslao González Manteinga (Universidade de Santiago del Compostela) and Emilio Porcu (University***of Castilla la Mancha, Ciudad Real)*

We congratulate the authors for this beautiful paper, where some new optimality criteria for optimal design are proposed and where the necessity of compound criteria to take into account multiple objectives rather than inferential purposes only is emphasized.

(a) The optimality methods enucleated by the authors correspond to different philosophies of
**weight-ing the positive deﬁnite matrix X**^{}**X or its inverse, as well as to different metrics (determinant,**
trace and max). We wonder how this comparison can be done for processes being correlated over
*time or space, and where the index j in equations (1) and (2) would represent the index set of the*
observations.

(b) Another interesting point of discussion would be to consider more sophisticated models (see, for example, Dette and Wong (1999)) where the variance is a function of the mean. This assumption is common in applied works (see, for example, Jobson and Fuller (1980)) where some functional relationships between the variance of the process and the parameter vector have been assumed.

**Table 10.** *I -optimal *
16-run design for a full
quadratic model in three
factors (example 1)

X1 X2 X3

−1 −1 −1

−1 −1 1

−1 0 0

−1 1 −1

−1 1 1

0 −1 0

0 0 −1

0 0 0

0 0 0

0 0 1

0 1 0

1 −1 −1

1 −1 1

1 0 0

1 1 −1

1 1 1

(c) There are probably some issues related to computational burdens that are associated with some efﬁciency criteria, since some of them imply computation of the inverse of some matrix whose dimension can be large. Is there any trade-off between statistical efﬁciency and computational complexity for this kind of models?

(d) It would be interesting to study optimality criteria that can allow us to take into account hypothesis
testing on the structure of the mean*μ*iin equation (2).

(e) Finally, an extension of potential interest may be that of a completely or partially unknown function
**f, corresponding respectively to non-parametric or semiparametric modelling.**

**Peter Goos (Universiteit Antwerpen and Erasmus Universiteit Rotterdam)**

I compliment the authors on presenting an innovative view on optimal experimental design. However,
given the focus on response surface designs, I would have expected them to pay more attention to the
*pre-diction-oriented I -optimality criterion. The I -optimality criterion seeks designs that minimize the average*
prediction variance, which, for a completely randomized design, is proportional to

average variance=

*χ*

**f**^{}**.x/.X**^{}**X/**^{−1}**f.x/ dx**

*χ*

**dx**

,

with*χ representing the experimental region. If a pure error estimate is used and the interest is in prediction*
*intervals, this I -optimality criterion can be easily modiﬁed to the IP(α) criterion*

*IP.α/ =*

*χ*F*1, d;1−α***f**^{}**.x/.X**^{}**X/**^{−1}**f.x/ dx**

*χ*

**dx**

,

in the spirit of the paper. This quantity can be computed as
*IP.α/ =*F*1, d;1−α*

*χ*

**dx**

tr**{.X**^{}**X/**^{−1}**M***} =*F*1, d;1−α*
*X*

**dx**

tr**{M.X**^{}**X/**^{−1}*},*

**Table 11.** *I -optimal *
26-run design for a full
quad-ratic model in three factors
(example 3)

X1 X2 X3

−1 −1 −1

−1 −1 0

−1 −1 1

−1 0 −1

−1 0 1

−1 1 −1

−1 1 0

−1 1 1

0 −1 −1

0 −1 1

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 1 −1

0 1 1

1 −1 −1

1 −1 0

1 −1 1

1 0 −1

1 0 1

1 1 −1

1 1 0

1 1 1

**where M is the moments matrix**

**x∈χ****f.x/ f**^{}* .x/dx. This ﬁnal expression for the IP.α/ criterion has exactly*
the same form as the LP-criterion in the paper.

*Interestingly, the I -optimality criterion itself generally produces attractive designs for completely *
ran-domized experiments, in that they usually strike a reasonable balance between degrees of freedom for pure
*error estimation and lack of ﬁt. The I -optimal design for example 1 in the paper, shown in Table 10, has*
1 degree of freedom for pure error (due to two centre point replicates) and 5 degrees of freedom for lack
of ﬁt. The design is therefore more useful than the (DP)S-optimal design for that example, or the (AP)S
-optimal design corrected for multiple comparisons, because these designs leave no degrees of freedom for
*lack of ﬁt. The I -optimal design for example 3 in the paper, shown in Table 11, has 5 degrees of freedom*
*for pure error (due to six centre point replicates) and 11 for lack of ﬁt. The I -optimal design for example*
5 in the paper, shown in Table 12, has 4 degrees of freedom for pure error (due to ﬁve replicates of the
point (*−1, 0, 0, 0, 0)) and 16 for lack of ﬁt. So, also for examples 3 and 5, the I-optimal design provides*
a better balance between degrees of freedom for pure error and for lack of ﬁt than the (DP)S-criterion.

*In conclusion, the I -optimality criterion yields completely randomized designs that allow for pure error*
estimation and testing for lack of ﬁt. It would be interesting to verify whether the IP(*α) criterion results in*
*better designs than the classical I -optimality criterion, and to embed that criterion in compound criteria.*

It would be interesting to extend the present work to multistratum response surface experiments, where
more than one variance component needs to be estimated, and where, in general, the non-orthogonality of
*the designs makes it less straightforward to determine degrees of freedom for the global F -test, conﬁdence*
intervals and prediction intervals, and even for tests on individual model parameters.

**Linda M. Haines (University of Cape Town)**

I congratulate the authors on a stimulating paper which encourages traditional optimal designers to move out of their equivalence theorem mind set and to explore meaningful criteria through computation.

**Table 12.** *I -optimal 40-run*
design for five factors, one at
two levels and four at three
levels, using a full quadratic
model (example 5)

X1 X2 X3 X4 X5

−1 −1 −1 −1 0

−1 −1 −1 1 −1

−1 −1 −1 1 1

−1 −1 0 0 1

−1 −1 1 −1 −1

−1 −1 1 −1 1

−1 −1 1 1 −1

−1 0 −1 −1 −1

−1 0 0 0 0

−1 0 0 0 0

−1 0 0 0 0

−1 0 0 0 0

−1 0 0 0 0

−1 0 1 1 1

−1 1 −1 −1 1

−1 1 −1 1 −1

−1 1 −1 1 1

−1 1 0 0 −1

−1 1 1 −1 −1

−1 1 1 −1 1

−1 1 1 1 0

1 −1 −1 −1 1

1 −1 −1 0 −1

1 −1 0 −1 −1

1 −1 0 1 0

1 −1 1 0 0

1 −1 1 1 1

1 0 −1 0 1

1 0 −1 1 0

1 0 0 −1 1

1 0 0 0 0

1 0 0 1 −1

1 0 1 −1 0

1 0 1 0 −1

1 1 −1 −1 −1

1 1 −1 0 0

1 1 0 −1 0

1 1 0 1 1

1 1 1 0 1

1 1 1 1 −1

I have one particular comment which relates to (DP)S-optimality. Speciﬁcally, the authors observe that
the (DP)S-criterion on its own is ‘quite extreme’ and I therefore wondered whether other simply formulated
single criteria would produce more pleasing designs. To this end I explored the idea of maximizing two such
criteria, ln|X^{T}Q0X| + ln.d/, which combines information on the parameter estimates with information on
the estimate of pure error in a natural way, and an attenuated version, ln|X^{T}Q0X| + ln.d/ + ln.n − p − d/,
which additionally incorporates information on the estimation of lack of ﬁt. The results that were obtained
from implementing an exchange algorithm to construct the requisite designs for examples 1 and 2 with 1
million random starts are summarized in Table 13. Clearly the designs are not unique. This is not
unex-pected since three-dimensional symmetry considerations hold but it is interesting to note that the numbers
of (DP)S-optimal designs are inordinately large. The problem of which design to use in practice could be

**Table 13.** Number of optimal designs and pure error degrees of freedom for
examples 1 and 2†

*Criterion* *Results for example 1* *Results for example 2*

*N* *d* *N* *d*

ln|X^{T}Q0X| 24 0 8 1

ln|X^{T}Q0X| − .p − 1/ ln.F*p−1,d;1−α*/ 6341 6 1064 8

ln|X^{T}Q0X| + ln.d/ 24 3 24 5

ln|X^{T}Q0X| + ln.d/ + ln.n − p − d/ 24 3 12 3

*†N is the lower bound on the number of designs.*

addressed by introducing a two-stage design procedure, i.e. by selecting designs from the class of (DP)S -optimal designs which are -optimal with respect to secondary criteria. In addition symmetry-based classes of optimal designs could be identiﬁed by invoking group theoretic arguments and, more speciﬁcally in the present case, the theory of molecular symmetry and point groups. Search procedures could then be restricted to non-isomorphic groups. To return to the main point of my exercise, however, the two simple criteria that I have introduced do indeed provide optimal designs with fewer repeated points than their (DP)S-optimal counterparts and may be worth further consideration.

I have two other smaller comments. First I wondered whether it would be advantageous,
computation-ally, to use candidate-set-free co-ordinate exchange algorithms in the construction of the optimal designs,
*particularly when p is large or when n is large and benchmark near-continuous designs are sought. I also*
noted that results from experiments based on repeated points can be ﬁtted by using linear mixed models.

However, I am not clear what advantages in terms of criterion formulation and design construction, if any, might accrue from this insight.

* Heinz Holling (University of Muenster) and Rainer Schwabe (Otto von Guericke University, Magdeburg)*
The authors provide interesting access into the deﬁnition of meaningful design criteria based on the needs
for statistical inference, particularly for the construction of conﬁdence intervals. This attempt is in full
agreement coincidence with the general requirement ‘Estimates of treatment effects should be
accom-panied by conﬁdence intervals, whenever possible’ and the requirement ‘Statistical analysis is generally
based on the use of conﬁdence intervals’ for equivalence trials stated in European Medicines Agency
(1998), pages 27 and 18. Therefore, this approach shows great promise.

However, the restriction to variance estimates based on pure error rather than on residuals presupposes that the underlying model assumed may not be correct. In that case the variance estimate based on resid-uals may be biased, as the authors state. But then also the bias in the parameter estimates and in the prediction should be taken into account, which should result in completely different criteria based on the mean-squared error instead of the variance. Moreover, in this case of lack of ﬁt it may become unclear what the meaning of the estimated parameters is, in particular, if the deviation of the model assumed is essential.

Therefore, criteria based on the quality of estimation of the response function or for prediction of further
*observations like the integrated mean-squared error and the G-criterion suitably adjusted for conﬁdence*
or prediction intervals are to be preferred over criteria based directly on the performance of the parameter
estimates. As a side remark, the proposed GP-criterion, which is good for (pointwise) conﬁdence intervals,
must be adjusted to F*1, d,1−α*[1+ maxx**{.1 f.x/**^{}**.X**^{}**X/**^{−1}**.1 f.x/**^{}/^{}*}] for prediction intervals. This may result*
in even more replications for the corresponding optimal design. In contrast with the above emphasis on
replications, non-parametric regression would be more adequate than ﬁtting a low dimensional response
surface, if there may be a substantial lack of ﬁt, and equidistributed designs turn out then to be optimal
for pointwise estimation of the response (Rafajlowicz and Schwabe, 2003),

As a conclusion, under severe uncertainty with respect to the model it may be doubtful to strive for optimal designs, and safeguarding against bias in the variance estimate may propagate bias in the point estimate of the response. Hence, an optimal design is only as good as the adequacy of the underlying model.

**Bradley Jones (SAS Institute, Cary) and Dibyen Majumdar (University of Illinois at Chicago)**

We congratulate the authors on a scholarly and useful paper. The idea of incorporating the need for the

estimation of the error variance in the optimality criterion has merit. Forcing replicate runs into the opti-mal design is an interesting idea. The use of a compound criterion for design will give the practitioner ﬂexibility to tailor the design to the speciﬁc goals of the experiment.

The level of signiﬁcance of the test seems to be a tuning parameter of the algorithm. We wonder whether using a larger value for the level of signiﬁcance might result in designs that do not require so many replicate runs. The authors do not discuss the details of their algorithm except for saying that it is an exchange algo-rithm. However, it seems that the number of replicate runs is a free variable in the optimization. We think that it would be useful to set this number to a range of values and to do several optimizations with respect to this constraint. The properties of resulting designs would allow the practitioner to make trade-offs between the utility of extra degrees of freedom for pure error and other criteria of design quality.

The authors have stated a preference for using pure error degrees of freedom, rather than lack-of-ﬁt
degrees of freedom, for the estimation of the error variance and further testing of hypotheses. It is true that
*lack-of-ﬁt estimates are only unbiased if higher order effects not in the a priori model are zero. However, *
de-sign optimality theory relies heavily on the model assumed and, if the authors are concerned about the bias
in the error variance estimate due to misspeciﬁcation of the model, then it seems reasonable that they should
also be concerned about the bias in the estimators of the treatment parameters. The fact that their optimality
criterion tends to use up the extra degrees of freedom for replication has consequences. First, tests for lack of
ﬁt will lack power. Second, little information will be available for estimating a higher order effect if it exists.

*Third, if indeed the a priori model is correct, forcing replicate runs will reduce the efﬁciency of inference.*

Despite these concerns the paper makes a signiﬁcant contribution in raising the issue of error estimation in design optimality. We see the need for an evenly divided approach to the assignment of pure error and lack-of-ﬁt degrees of freedom, coupled with model robustness considerations, in the choice of design.

**Joseph B. Kadane (Carnegie Mellon University, Pittsburgh)**

An experimental design is a gamble, in that one does not know what data will ensue, and hence what inferences will result. Thus experimental design is a statistical decision problem, characterized by several inputs: the set of allowable designs to choose between, the purpose(s) to be served by the observations once collected and a belief about the nature of the observations not yet available.

To discuss optimal design, it is necessary to be precise about these inputs.

The paper is particularly strong in its insistence on the second input, which can be thought of as speciﬁca-tion of a utility (or, equivalently, loss) funcspeciﬁca-tion. It is less strong on being explicit about the third component:

the probability model. If one intends to use model (2) to estimate*β, why would one want to use model*
(1) to estimate*σ*^{2}(called the ‘true error’ in the paper)? I think that it would be more faithful to the beliefs
of the experimenter to draw observations towards the subspace of model (1) speciﬁed by model (2). The
extent to which such smoothing is done should depend on how strongly model (2) is believed as a special
case of model (1). There are well-studied models that have this behaviour. A compromise view of*σ*^{2}results.

Although my taste in inference is different from those of the authors, I recognize their paper as a contri-bution towards more accurately ﬁnding good designs that better reﬂect their purposes and the underlying beliefs about the data-generating process.

**Joachim Kunert (Technische Universität Dortmund )**

The choice between the various methods of estimating the error is not always as clear cut as Gilmour and Trinca state in their paper. There can be situations when it is not appropriate to base inference on the pure error and when lack of ﬁt must be used instead.

One case where this is true is if we want to make predictions of the response under conditions that have not been used in the experiment. Speciﬁcally, consider sensory experiments with consumers. Assume that we have performed an experiment with 11 consumers to compare A and B, where each consumer evaluates both A and B twice. Further assume that consumers 1–6 consistently give 10 out of 10 points to A and 8 out of 10 to B, whereas consumers 7–11 always rate A as 8 and B as 10. We observe that the average difference between A and B is 2/11 and the pure error estimate is 0; hence A is signiﬁcantly better than B. Most likely, however, we are not interested in these 11 consumers only but would like to predict the preferences for a larger set of consumers. If one consumer consistently prefers product A to B, this does not mean that other consumers will also prefer A to B. It is well known, therefore, that for this experiment we should not base our inference on the pure error. Instead, we should introduce a random interaction between assessors and products into the model and use the interaction sum of squares to estimate the variance.

Another class of examples where pure error might not be always appropriate is fractional factorial 2^{k−l}
-designs. If we want to predict the response at a factor combination which has not been used in the design,

the pure error might grossly underestimate prediction variance. The sum of squares for lack of ﬁt, however, will contain information about the size of possible interactions which we did not have in the model.

Thus there can be good reasons why some textbooks recommend use of the residual mean square.

Increasing only the degrees of freedom of pure error has some disadvantages. In particular, it may reduce information about the lack of ﬁt of our model. Increasing the degrees of freedom for the lack of ﬁt can help to detect mistakes in the way that the experiment was run or in the choice of the model for the analysis.

**Jesús López-Fidalgo (Universidad de Castilla-La Mancha)**

A typical criticism of optimal experimental design theory is that it usually provides not enough different points to estimate the variance and to make inferences. This paper launches an interesting and useful idea on experimental design for inferences through a proper estimation of the error. The real examples considered give an added value to the paper.

Although the main objective of the paper is not focused on discriminating between models there is a
possible link between the approach of this paper and optimal designs for discriminating between models
*(1) and (2). For this approach see for example Atkinson and Fedorov (1975) and López-Fidalgo et al.*

*(2007) where they use T -optimality and Kullback–Leibler distance optimality as an extension of the ﬁrst*
*to non-normal models. If there is interest in estimating the p parameters of model (2) then t p. This*
means that model (2) is somehow nested into model (1). Let**μ**^{.0/}*= .μ*^{.0/}1 , . . . ,*μ*^{.0/}t /^{}be nominal values for
the means, where each mean*μ*iis repeated nitimes in the vector. Thus, considering model (1) as the true
model the criterion consists of maximizing

T.t, ni/ = min

*β* **{.μ**^{.0/}**− Xβ/**^{}**.μ**^{.0/}**− Xβ/}**

**= .μ**^{.0/}**− Xβ/**^{}**.μ**^{.0/}**− Xβ/**

**= μ**^{.0/}^{}**.I − X**^{}**.X**^{}**X/**^{−1}**X/μ**^{.0/}:

This is the non-centrality parameter of the likelihood ratio test for the lack of ﬁt of model (2) when model
(1) is considered to be the true model. In this formula ˆ* β are the maximum likelihood estimates of model*
(2) assuming that

**μ**^{.0/}are the responses. The most disturbing point here is the need for nominal values for the means. Note that this vector depends in any case on the design. This means that the general function T.t, ni/ claims nominal values for each particular possible design point.

**Hugo Maruri-Aguilar (Queen Mary University of London)**

Response surface methodology is often performed in stages, with the ﬁrst stage being concerned about
screening unimportant factors in a ﬁrst-order model. A second stage may involve a more complex model,
*such as a second-order model, possibly with interactions; see Box et al. (2005). Assume that the design*
region remains unchanged for the second stage. Could the methodology that is described in the paper be
adapted for its use in a sequential design approach?

Although one of the criteria (DP)_{S}is tailored to distinguish between nested models, it would seem that
some adaptation would be required, as the list of active factors may not be known in advance. A block
design with separate models for each block as in equation (3) might also be used. However, it would still
present the same inconvenience of not knowing active factors in advance.

A simpler approach for this sequential approach would be to search for a DP(*α) design in which the*
**part of the design matrix X corresponding to the ﬁrst stage remains ﬁxed, while the search for the optimal**
design is only carried out for the new design points and a bigger model. This approach could be reﬁned
to design a block optimal design in which the ﬁrst block has already been ﬁxed. What would the authors
suggest in this case?

I would also like to join those who thanked the authors. This is a thought-provoking, most welcome advance in response surface modelling.

* Jorge Mateu, Francisco J. Rodriguez-Cortés and Jonatan A. González (University Jaume I, Castellón)*
The authors are to be congratulated on a valuable contribution and thought-provoking paper in the context
of design of experiments. They develop compound criteria by taking into account multiple objectives. We
would like to bring the authors’ attention to a particular problem that could be beneﬁt from this strategy.

Spatial point processes describe stochastic models where events that are generated by the model have
*an associated random location in space (Diggle, 2003; Illian et al., 2008). Second-order properties *
*pro-vide information on the interaction between points in a spatial point pattern. The K -function K.r/ is one*

such second-order characteristic that can be expressed as an expectation or the number of events within a
*distance r of an arbitrary event.*

A common problem in the analysis of spatial point patterns arising in some disciplines, such as
*neuro-anatomy (Diggle et al., 1991), neuropathology or dermatology, has to do with the identiﬁcation of *
*differ-ences between several replicated patterns or groups of patterns, and the K -function plays a crucial role. This*
function could be signiﬁcantly affected by a set of exogenous factors, and a linear ﬁxed effects model could
* be implemented in a classical way as K.r/ − Xβ + ε. As usual, we may assume (although with some caution)*
that the within-group errors are independent and identically normally distributed with zero mean. In this
context, we aim at maximizing the goodness of ﬁt while checking the lack of ﬁt by considering an
opti-mum design. We wonder whether the compound criterion that is proposed by the authors and described in
Section 4.1 and equation (5) can be adapted in this context to gain information on the (minimum) sampling
size of the point patterns, which is a particular topic that remains open in the spatial literature.

* This idea can also be extended to a linear mixed model case, in which K.r/ = Xβ + Zu + ε where u is*
a vector of random effects. Fixed effects describe the behaviour of the entire population, and random
effects are associated with individual experimental units sampled from the population. Instead of the

*sec-ond-order K -function, we could also assume that the conditional intensity is log-linear or comes through*a mixed model (Bell and Grunwald, 2004) of the form log

*{λ.x*0

*.x0, x/*

**; x/} = X***.x0, x/*

**β + Z****u**

*.x0, x/ where*

**+ ε****X**

_{.x}

_{0}

_{, x/}is a ﬁxed effect design vector at location x0

**and Z**

_{.x}

_{0}

_{, x/}is the random-effects design vector at x0. We pose the question whether a correct extension of the compound criterion in equation (5) by adding a covariance matrix of the random effects could be used to treat optimal designs in the spatial contexts considered here.

**Werner G. Müller and Milan Stehlík (Isaac Newton Institute, Cambridge, and Johannes Kepler***University, Linz)*

We thank the authors for bringing up the important issue of estimating error and its inﬂuence on optimum design. However, we are unsure that their mistrust in the mean model at the same time justiﬁes their strong reliance on other invariances of the underlying distributions.

In fact their motivating example, exercise 11.6 of Box and Draper (2007), is particularly well suited for illustrating our point. Box and Draper ﬁrst eliminated observation y6=86:6 owing to a signiﬁcant lack-of-ﬁt test. Then, as the authors point out, the test for second-order parameters based on the pure error does not re-ject the null hypothesis whereas the test based on pooled error does. However, if we do not drop observation y6the conclusions completely reverse. Moreover, if we vary its value as is done in Fig. 5 this reversion stays the same for a great range of values that would not be detected by the lack-of-ﬁt test—those to the left of the vertical line. Thus we wonder whether, for design purposes, it may not be advantageous to enter model uncertainties explicitly into the criterion as for instance in Liu and Wiens (1997) or Bischoff and Miller (2006).

Another issue concerns the use of compound designs later in the paper. After their conception in Läuter
*(1976) these have been successfully employed for many purposes (see for example McGree et al. (2008) or*
Müller and Stehlík (2010)). The challenge in their efﬁcient use, however, remains in the proper selection
of the weights*κ involved. We would have appreciated a more detailed discussion of the consequences of*
the particular choices for*κ that were taken in the paper and the sensitivity of the design to other choices.*

**Kalliopi Mylona (University of Antwerp and University of Southampton)**

The authors introduce modiﬁed optimality criteria for the construction and evaluation of fractional fac-torial designs and response surface designs. The new criteria are applied to several cases with respect to the number of runs and factors and a comparison is provided of the optimal designs constructed by the new approach with the optimal designs constructed by traditional approaches. I believe that the results convincingly demonstrate the effectiveness and the usefulness of the new criteria, especially the results of the compound criteria which offer substantial ﬂexibility to the experimenter to tailor the design to the experimenter’s objectives. I was wondering whether, instead of a compound criterion, a sequential opti-mization of the corresponding criteria for the construction of the optimal designs would be possible. In this way, there would be no need to deﬁne weights each time.

Furthermore, in the case of blocked designs, the authors consider the block effects as ﬁxed. However, it would be interesting to investigate to what extent the results would be affected if the blocked effects were considered as random and how effective the new criteria would be. In this case, a Bayesian adapta-tion of the criteria to deal with the problem of the unknown variance components could be possible and helpful.