Central Composite Discrepancy-Based Uniform Designs for Irregular Experimental Regions

(1)

PREPRINT

國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2012- 11.pdf

Central Composite Discrepancy-Based Uniform Designs for Irregular Experimental Regions

Ray-Bing Chen, Yen-Wen Hsu, Ying Hung, and Weichung Wang

November 1, 2012

(2)

Central Composite Discrepancy-Based Uniform Designs for Irregular Experimental Regions

Ray-Bing Chen^a, Yen-Wen Hsu^b, Ying Hung^c, Weichung Wang^b,∗

aDepartment of Statistics, National Cheng Kung University, Tainan 701, Taiwan

bDepartment of Mathematics, National Taiwan University, Taipei 10617, Taiwan

cDepartment of Statistics and Biostatistics, Rutgers University, NJ 08854, USA

Abstract

Central composite discrepancy (CCD) has been proposed to measure the uniformity of a design over irregular experimental region. However, how CCD-based optimal uniform designs can be efficiently computed remains a challenge. Focusing on this issues, we proposed a set of particle swarm optimization-based algorithms to efficiently find optimal uniform designs with respect to the CCD criterion. Parallel computation techniques based on state-of-the-art graphic processing unit (GPU) are employed to accelerate the computations. Several two- to five-dimensional benchmark problems are used to illustrate the advantages of the proposed algorithms. By solving a real application in data center thermal management, we further demonstrate that the proposed algorithm can be extended to incorporate the non-collapsing property.

Keywords: Uniform Design, Central Composite Discrepancy, Discrete Particle Swarm Optimization, Graphic Processing Unit, Parallel Computing, Non-collapsing

1. Introduction

Efficient experimental design is crucial in the study of scientific experiments. These experiments may have high dimensional inputs, often prohibits straightforward aproaches for running the experiment over a dense grid of input configurations. In particular, uniform design is one of the most widely used approaches in the literature (Fang, 1980; Fang et al., 2000, 2002, 2003; Ma and Fang, 2004). It possesses a desirable space-filling property in which the design points are placed over the entire experimental region. Further description of space-filling designs can be found in Santner et al. (2003).

The concept of uniform design is attractive. However, such designs are developed mainly for regular experimental regions, such as rectangular or hypercubic regions. In practice, the assump- tion of the regularity of experimental regions is often violated especially in computer experiments (Ranjan et al., 2008; Hung et al., 2010; Hung, 2011). Although a few design approaches have been

∗Corresponding author

Email address: wwang@ntu.edu.tw (Weichung Wang)

(3)

proposed for irregular regions, such as Stinstra et al. (2003), Auffray et al. (2012) and Dragulji´c et al. (2012), they were not developed for uniform designs. Recently, Chuang and Hung (2010) proposed central composite discrepancy (CCD) as a new uniformity criterion for a large class of irregular regions. Lin et al. (2010) proposed a threshold accepting algorithm to generate the U-type designs used for the irregular regions based on CCD.

In this article, we focus on the CCD criterion for optimal uniform designs on arbitrary irregular experimental regions. Although CCD appears to be a reasonable extension of irregular regions, its computational challenges in optimal design search and uniformity evaluations remain unsolved, especially for larger numbers of factors and run sizes. This computational issue is the main focuse of the current paper. The contributions of this paper are outlined as follows:

• A new particle swarm optimization method (PSO) for searching optimal CCD designs is proposed. The algorithm is a stochastic population-based heuristic that inherits the efficiency and capability of PSO for solving high-dimensional optimization problems with multiple optima.

• The complexity of the discrete version of the CCD criterion is analyzed and the the result suggests that the computational cost increases rapidly. Parallel computing techniques based on the latest graphic processing unit (GPU) are thus applied to significantly accelerate the CCD function evaluations.

• The proposed algorithms are implemented and numerical results are compared with existing methods for two and three-dimensional irregular experimental regions. Results on higher dimension regions are also provided to demonstrate the capability of the proposed algorithms.

• The algorithms are used to solve a computer data center thermal management problem (Hung et al., 2010; Hung, 2011). In this real world application, we also show how the proposed algorithms can be easily extended to handle disconnected experimental regions and the non-collapsing property.

This paper is organized as follows. In Section 2, we introduce the central composite discrepancy criterion and state our optimization goal. The complexity of the CCD criterion evaluation is also analyzed. In Section 3, we propose a GPU-accelerated discrete particle swarm optimization (DPSO) algorithm to find the optimal designs over irregular experimental regions in terms of the CCD criterion. In Section 4, we apply the proposed algorithm to solve the flexible region benchmark examples and compare its performance to an existing method. In Section 5, we demonstrate how the algorithm can be extended to disconnected experimental regions and the reduction of collapse.

Final conclusions are discussed in Section 6.

2. Central Composite Discrepancy and the Complexity Analysis

After introducing the definitions of uniform design and CCD, we study the computational complexity of the central composite discrepancy in this section. First, the definition of the uniform

(4)

design is given as follows. To spread out the design points uniformly over an experimental region D, where D ⊆ R^K and K is the number of variables (or factors), Fang (1980) proposed the following uniform design:

Definition 1 (Uniform design). Suppose D is the experimental region in R^K. Let P = {p1, · · · , pn|pi∈ D, i = 1, · · · , n} be an n-point design in D and let Z(n) be the set of all possible n-points designs in D. Give a measure of uniformity, M U . Then, a design P^∗∈ Z(n) is called a uniform design if

M U (P^∗) = min_P∈Z(n)M U (P). (1)

For the uniformity measurements of the regular experimental region (i.e., hypercube), the Lp- star discrepancy (Hua and Wang, 1981) is widely used. Some generalizations of the L_p-star discrepancy are centered Lp-discrepancy and wrap-around Lp-discrepancy (Hickernell, 1998). However, these discrepancies cannot be used in instances where the experimental region is not a hypercube shape, such as the convex polygon in Chuang and Hung (2010). For an irregular experimental region, Chuang and Hung (2010) proposed the central composite discrepancy which can be treated as an extension of the centered L2-discrepancy. The definition of the CCD is given as follows.

Definition 2 (Central composite discrepancy). Give an n-points design P = {p1, · · · , pn} on D ⊂ R^K. The central composite discrepancy value of this design P is defined as

CCDp(P) =





 1 v(D)

Z

D

1 M^K

M^K

X

k=1

|N (Dk(x), P)

n −v(Dk(x)) v(D) |^pdx







1/p

. (2)

Here, v(Dk(x)) and v(D) denote the volume of Dk(x) and D, respectively, p > 0, and N (Dk(x), P) = Pn

i=1I{pi∈ Dk(x)}. The experimental region D is partitioned into M^K sub-domains Dk(x). In particular, letting x⁽ⁱ⁾= {r ∈ R : x+ai < r ≤ x+ai+1, i = 0, ..., M −1}, where a0≡ −∞, aM ≡ ∞, a₁< a₂< ... < a_{M −1} and some a_j= 0 for 1 ≤ j ≤ M − 1, and D_k(x) = {x⁽ⁱ₁¹⁾× ... × x⁽ⁱ_K^K⁾} ∩ D.

The idea behind CCD is to first partition the experimental region D into several sub-regions and then compute both the ratio of the number of design points in each sub-region to the total number of design points and the ratio of the sub-region volume to the overall experimental region volume. In a uniform design, we expect these two ratios to be close, so the difference of these two ratios can be a criterion of uniformity. As the CCD criterion chooses every point in the region as the center to partition and take the integral over the whole region, it can be applied to all kind of experimental regions. More details about the CCD can be found in Chuang and Hung (2010) and Lin et al. (2010).

As mentioned in Chuang and Hung (2010), searching for the optimal design over a continuous experimental region based on the CCD criterion is an NP-hard problem. Consequently, it is common (Chuang and Hung, 2010; Lin et al., 2010) to consider a discrete version of CCD and then search for the discrete uniform design (or near uniform design) for practical considerations.

In particular, we consider the following discrete version of CCD. Instead of choosing design points arbitrarily in the experimental domain by Definition 1, the candidate design points are limited to a set of the grid points in the experimental region. We apply the idea of “grid level” proposed

(5)

in Fang et al. (2000) to define the following discrete uniform design (denoted as U D) and the corresponding discrete CCD (denoted as CCD).

Next, we define the discrete uniform design and discrete CCD. Without loss of generality, we assume that the experimental region can be transformed to a region within a unit hypercube in the definition. Based on these definitions, Theorem 5 provides a complexity analysis for computing the discrete CCD over the candidate grid points, which is helpful for efficiently designing the search approach.

Definition 3 (Discrete uniform design). Define ¯D(q^K) = {x = (x1, . . . , xK)|xi = ^2lⁱ_2q⁻¹, li = 1, · · · , q, i = 1, . . . , n.} ∩ D , where K is dimension, q is grid level, and D is the experimental region. Let ¯P = {p1, . . . , pn} be a discrete design if n design points pi∈ ¯D(q^K), i = 1, . . . , n. Let Z(n, q¯ ^K) be the set of all possible discrete designs with n design points. A design ¯P^∗ ∈ ¯Z(n, q^K) is called a uniform discrete design of n design points under q grid level if

M U ( ¯P^∗) = minP∈ ¯¯ Z(n,q^K)M U ( ¯P). (3)

The corresponding discrete CCD value is defined below. For simplicity, ¯D(q^k) is denoted as D. In addition, we count the number of grid points falling in the ¯¯ Dk(x) and ¯D as | ¯Dk(x)| and

| ¯D| to approximate the volume of v( ¯Dk(x)) and v( ¯D), respectively. The detail of how to compute CCDp(P) are presented in Appendix A.1.

Definition 4 (Discrete CCD). We define the discrete CCD value

CCD_p(P) =





 1 v( ¯D)

X

x∈ ¯D

1 M^K

M^K

X

k=1

|N ( ¯D_k(x), P)

n −v( ¯D_k(x)) v( ¯D) |^p 1

q^K







1/p

=





 1

| ¯D|

X

x∈ ¯D

1 M^K

M^K

X

k=1

|N ( ¯D_k(x), P)

n −| ¯D_k(x)|)

| ¯D| |^p







1/p

. (4)

Here, ¯Dk(x) =x|x ∈ Dk(x) ∩ ¯D , N ( ¯Dk(x), P) =Pn

i=1I{pi∈ ¯Dk(x)}, v( ¯Dk(x)) = | ¯Dk(x)|/q^K, and v( ¯D) = | ¯D|/q^K.

The following theorem quantifies the computation complexity of CCDp( ¯P). It also suggests that the computational cost for evaluating CCDp( ¯P) increases rapidly in terms of the factor numbers and the number of candidate design points.

Theorem 5 (Complexity analysis of the discrete CCD). Let n be the design point number, K be dimensionality, and | ¯D| be the number of grid points in D. Then, for a n-points discrete design with design points ¯P = {p1, · · · , pn}, the computation complexity of CCDp( ¯P) on Eq. (4) is O(| ¯D|KM^K(n + | ¯D|)). Furthermore, if we choose |D| = n^K (in the hypercubic domain we get this by setting grid level q = n ), the complexity becomes O(KM^Kn^2K).

Proof. The main cost for computing CCDp( ¯P) is the computation of N ( ¯Dk(x), P), and v( ¯Dk(x)).

Here, 1 ≤ k ≤ M^Kand x ∈ ¯D. For computing N ( ¯D_k(x), P), we must consider whether each design

(6)

point is in the corresponding region. To compute for one point requires O(K), and thus requires O(nK) in total. Computation for v( ¯Dk(x)) is similar. Therefore, the computation complexity of the former and the latter is O(nK) and O(| ¯D|K), respectively. The two summation entails that we must compute these two values O(| ¯D|M^K) times; thus, overall it needs O(| ¯D|M^K(nK + | ¯D|K)) = O(| ¯D|KM^K(n + | ¯D|)) to compute a single CCDp( ¯P).

3. Computational Methods for Optimal Discrete CCD Uniform Design

The discussions in Section 2 result in the following discrete optimization problem for the discrete uniform design:

¯ min

P∈ ¯Z(n,q^K)

CCDp( ¯P). (5)

The fundamental difficulty in solving this discrete optimization problem is the large number of candidate designs. For example, if finding the optimal uniform design with 11 points from a square domain with 31 × 31 = 961 grid points (i.e., K = 2, n = 11, and q = 31), the number of possible designs is | ¯Z(11, 31²)| ≈ 1.53 × 10²⁵. It is thus essential to develop an efficient algorithm to solve the discrete optimization problem defined in (5).

Particle swarm optimization is a population-based stochastic optimization technique. Many studies have shown its ability to provide better results in a faster and cheaper way, especially for multiple extremes and high-dimensional search regions. PSO has also been used in statistical problems. For example, Toala et al. (2011) studied the kriging parameter estimation in a computer experiment via a modified PSO. In this section, we propose a discrete variant of PSO to solve the optimization problem defined in (5). In addition, to accelerate the computations of the discrete CCD values for each design (Theorem 5 has shown its fast growing complexity), the GPU is employed. The power of the GPU acceleration is demonstrated in Section 4.

3.1. Discrete particle swarm optimization

The PSO was first introduced by Kennedy and Eberhart (1995; 2002). The idea originated from simulations of large-scale social behavior, such as in bird flocks or fish schools. At the beginning of PSO, a group of particles is randomly initialized on the solution space. Then, the particles move around the solution space to search for the optimal solutions. These particles keep their own cognitive memories on the best values visited. This “personal best” location is denoted as yi(t) for the ith particle and the tth iteration. The particles also exchange social information to obtain the best value obtained by the swarm. This “group best” location is denoted as ˆy(t). At each iteration, the position of each particle is updated by the formula

xi(t + 1) = xi(t) + vi(t + 1). (6) In particular, the jth component of the location update vi(t + 1) is defined as

vij(t + 1) = vij(t) + c1r1ij(t)[yij(t) − xij(t)] + c2r2ij(t)[ˆyj(t) − xij(t)], (7)

(7)

(a) Group best particle (b) Personal best particle corresponding to current particle

(c) Current particle before To- wardBestMove

(d) Current particle after TowardBest- Move

Figure 1: An example for TowardBestMove.

where c1 and c2are positive acceleration constants used to scale the contribution of the cognitive and social components, respectively, and r₁_ij(t) and r₂_ij(t) are independently generated from U [0, 1].

The aforementioned PSO is designed for continuous domains and cannot be used directly for solving the discrete optimization problem (5). Another obstacle arises from the irregular domain.

It is difficult to develop a general scheme for keeping the particles moving within various irregular domains such that the PSO will not spend unnecessary objective function evaluations that are outside the irregular experimental regions.

To overcome these difficulties, we propose the following discrete particle swarm optimization algorithm (DPSO) to directly solve the problem (5). The DPSO algorithm differs from the original PSO algorithm in how the particles are updated. DPSO ensures that the particles are moving within the irregular experimental domain. Unlike the continuous-based update formula (7), DPSO updates the particle positions by the following two steps.

TowardBestMove: We represent an n-point uniform design with K variables ¯P as a n × K matrix. In particular, [p⁰₁, . . . , p⁰_n]⁰, where pi is a K × 1 vector to denote the position of the ith run. For each row of the current particle design matrix ¯P (i.e., a point in design), we assume it has a probability probG (probP ) to be replaced by the corresponding groups

(8)

(a) RandMove on x-dim element of current particle

(b) RandMove on y-dim element of current particle

Figure 2: An example for RandMove.

(personal) best particle.

We illustrate this idea regarding how we move the particles toward the group or personal best points by the example shown in Figure 1. This example is a 3-point design for 2 variables, and the designs are represented by 3 × 2 matrices. Suppose the groups best particle is [2 4; 1 1; 4 2], the personal best particle is [3 0; 0 2; 5 3], and the current particle is [1 2; 0 1; 5 2]. After TowardBestMove proceeds, the current particle design matrix becomes [2 4; 0 2; 5 2]. That is, the first row is replaced by group best, the second row is replaced by personal best, and the third row is unchanged.

RandMove: The TowardBestMove step may lead the search process to a local optimal point.

Therefore, we insert an additional step to allow the particle to depart from the neighborhood of a local optimal point by the following means. Each element of the current particle design matrix has a probability probR to be replaced by another element which is uniformly randomly generated from feasible grid levels in that dimension.

For example, in Figure 2(a), the current particle design matrix is [2 4; 1 1; 4 2]. After a RandMove step, the element in the second row and first column of the design matrix is replaced by a number uniformly chosen from the set {0, 1, 2, 3, 4, 5}. Figure 2(b) shows an example in which the element in the third row and the second column of the design matrix is replaced by a number uniformly chosen from the set {0, 1, 2, 3} due to the limitation of the experimental region.

We conclude this section with the following three remarks.

1. Both TowardBestMove and RandMove steps ensure that the movement of particles remains within the irregular experimental region.

2. The DPSO algorithm is outlined in Algorithm 1.

3. For the tuning parameters probG, probP , and probR in the algorithm, we suggest choosing probG = probP = 0.4 and probR = 0.05 or 0.1. We also use these settings in the numerical experiments.

(9)

4. Because the RandMove step leads to the possibility that all particles would explore the whole design space ¯Z(n, q^K), and because this design space contains finite number of candidate designs, we make the following conjecture.

Conjecture 6 (Convergence for DPSO). Let ¯D(q^K) = {x = (x₁, · · · , x_K)|x_i=^2l_2qⁱ⁻¹, l_i= 1, · · · , q, i = 1, · · · , n.} ∩ D , where K is dimension, q is grid level, and D is the experimental region. Let ¯P = {p₁, · · · , p_n} be a discrete design with n design points p_i ∈ ¯D(q^K),and i = 1, · · · , n. Let ¯Z(n, q^K) be the set of all possible discrete designs with n design points, and ¯P^∗∈ ¯Z(n, q^K) is the uniform discrete design. Then, the solution design found by DPSO converges to ¯P^∗ as the iteration number I → ∞.

Algorithm 1 The Discrete Particle Swarm Optimization (DPSO) Algorithm.

Require: probG, probP , probR, initial nx-dimensional swarm S = {x1, · · · , xn_s} ⊂ Ω, Ω is the experimental region.

Ensure: Approximate global best position ˆy

1: while stopping criterion is not satisfied do

2: for each particle i = 1, · · · , n_sdo

3: if f (x_i) < f (y_i) then

4: y_i= x_i

5: end if

6: if f (y_i) < f (ˆy) then

7: y = yˆ _i

8: end if

9: end for

10: for each particle i = 1, · · · , nsdo

11: update the position via TowardBestMove

12: update position via RandMove

13: end for

14: end while

3.2. Accelerating function evaluations via a GPU

The GPU was originally designed for rendering graphics. Driven by the insatiable market demand for real-time, high-definition 3D graphics, the programmable GPUs have evolved into a highly parallel, multithreaded, many-core processors with tremendous computational horsepower and very high memory bandwidth. GPUs are well suitable for solving data-parallel computations with high arithmetic intensity where the ratio of arithmetic operations to memory operations is large. In this article, we use an NVIDIA GPU with the CUDA (NVIDIA, 2010) development environment to accelerate the function evaluations.

From the complex analysis for computing CCD in Theorem 5, we know that the cost for computing the CCD value for a given design increases quickly as design point number n and/or dimension K increases. For example, where K = 5 and n = 7 in a hypercubic domain, |D| = 16, 807 and the partitioned sub-region number on every grid point is 2^K = 2⁵= 32 (see Table 1). Therefore, to compute a CCD defined in Eq. (4), we need to

(10)

Step 1. Compute 32 ∗ 16, 807 = 537, 824 of N ( ¯D_k(x), P) and v( ¯D_k(x)), Step 2. Compute 537,824 of {^{N ( ¯}^D^k_n^(x),P)−^{| ¯}^D^k_{| ¯}_D|^(x)|)}², and

Step 3. Sum up 537, 824 of {^{N ( ¯}^D^k_n^(x),P)−^{| ¯}^D^k^(x)|)

| ¯D| }².

Observing these three-step computations, we see that, in Steps 1 and 2, the computation of N ( ¯Dk(x), P), v( ¯Dk(x)), and {N ( ¯Dk(x), P)−v( ¯Dk(x))}²are independent between different xi, xj∈ D. Therefore, we can easily parallelize the computations on each x ∈ ¯¯ D. In Step 3, we can also use multiple threads to share the workload to find the sum of these numbers. In particular, our first version of parallelization on the GPU, named DPSO-GPU-1, is illustrated as follows.

Step 1. To compute 2^K|D| of N ( ¯D_k(x), P) and v( ¯D_k(x)), we initialize |D| threads in the GPU.

Each thread computes N ( ¯Dk(x), P) and v( ¯Dk(x)), 1 ≤ k ≤ 2^K for some x ∈ |D|.

Step 2. To compute 2^K|D| of {N ( ¯D_k(x), P) − v( ¯D_k(x))}², we simply initialize 2^K|D| threads in the GPU. Each thread computes {^{N ( ¯}^D^k_n^(x),P) − ^{| ¯}^D^k_{| ¯}_D|^(x)|)}² for some x ∈ |D| and some 1 ≤ k ≤ 2^K.

Step 3. To sum up 2^K|D| of {N ( ¯D_k(x), P) − v( ¯D_k(x))}², we initialize 512 threads within a single block with index i for 1 ≤ i ≤ 512. We treat 2^K|D| of {^{N ( ¯}^D^k_n^(x),P)−^{| ¯}^D^k_{| ¯}_D|^(x)|)}² as an array of length 2^K|D|, whose elements are indexed by j for 1 ≤ j ≤ 2^K|D|. We first let the thread with index i sum up array elements whose index mod 512 is i, and then we sum these 512 elements.

Further improvement to the efficiency of GPU can be realized by the following observations.

These observations lead to another version of the GPU parallization called DPSO-GPU-2.

• As 2^K|D| of v( ¯Dk(x)) only depends on grid level q and experimental region D, the value of these 2^K|D| elements will not be altered in the whole DPSO algorithm. Therefore, we compute them only once in the initialization of DPSO and then store them as an array in the global memory of GPU to accelerate the speed of array element reading. In addition, we bind the array to the texture cache so that data are stored in the read-only texture memory, which leads to fast data access. Note that we also bind the data of ¯D to texture memory.

• To 2^K|D| of N ( ¯Dk(x), P), we need to read the data of design P in the GPU. While we can map the memory that stores P in the CPU to the GPU, a more efficient way is to accelerate the data reading by first copying the data into the shared memory within each block of GPU.

Like the texture memory, the shared memory also has special means for saving and loading data that are nearly as fast as the register.

• If 2^K|D| is too large, we need to modify the previous parallel scheme in Step 3. We partition those elements into groups, and each group contains 4 × 512 = 2, 048 elements. We initialize multiple 512-thread blocks and let the block number be the same as the group number. First, we add up each group using one block with previous scheme Step 3 and then add the numbers obtained by each block.

(11)

(a) K=2, m=9,999 (b) K=2, m=2 (c) K=2, m=1 (d) K=2, m=0.5 (e) K=2, m=0.3

(f) K=3, m=9,999 (g) K=3, m=2 (h) K=3, m=1 (i) K=3, m=0.5 (j) K=3, m=0.3

Figure 3: Shapes of the flexible region for a different parameter m for dimensions K=2 and 3.

Table 1: Chosen grid level and feasible grid point number |D| for a different dimension K and flexible parameter m.

(a) Grid level

K m=9,999 m=2 m=1 m=0.5 m=0.3

2 11 13 17 27 49

3 11 15 21 45 117

4 11 15 25 71 -

5 11 15 29 99 -

(b) Feasible grid point number |D|

K m=9,999 m=2 m=1 m=0.5 m=0.3

2 121 137 145 137 129

3 1,331 1,791 1,561 1,385 1,377

4 14,641 15,649 16,641 15,609 -

5 161,051 126,767 177,045 166,363 -

We have proposed the DPSO algorithm and its GPU-based acceleration. To evaluate the performance of DPSO in searching uniform designs via discrete CCD, we consider in the next sections a set of K-dimension flexible region problems (Lin et al., 2010; Draper and Guttman, 1986) and a real-world application (Hung et al., 2010). This application concerns a sensor placement problem for a data center located in the IBM T. J. Watson Research Center.

4. Numerical Results for the Flexible Region Problems

Our first set of benchmark problems is the flexible region problems. As defined in Lin et al.

(2010), a flexible region can be written as

Rm= {x = (x1, · · · , xK) ∈ [0, 1]^K||2(x1− 1/2)|^m+ · · · + |2(xK− 1/2)|^m≤ 1, }

(12)

2 3 4 5 0

500 1000 1500 2000 2500 3000

Dimension K

Time (second)

55.2x 1.34x CPU

DPSO−GPU−1 DPSO−GPU−2

Figure 4: Execution time comparison between the CPU (no parallization), DPSO-GPU-1 (parallelization of CCD function on GPU), and DPSO-GPU-2 (parallelization of CCD function on GPU with code optimizations). The number of particles is 50 particles and the iteration in DPSO is 200.

and the corresponding discrete grid point set with q levels is R¯m(q) = {x = (x1, · · · , xK)|xi=2li− 1

2q , li= 1, · · · , q, i = 1, · · · , n.} ∩ Rm.

Figure 3 illustrates some examples of such flexible regions for K = 2, 3 and various m’s. Our choices of grid level q for a different flexible parameter m and dimension k are listed in Table 1. For the same dimension, we appropriately choose q such that feasible grid point numbers are similar for different flexible parameters m (which correspond to different shapes of the experimental region).

All of the numerical experiments in this and the next sections were performed on a Linux workstation with Intel Xeon CPU E5520 at 2.27GHz, 24 GB main memory, and an NVIDIA TESLA C1060 GPU.

4.1. Time saving due to GPU

To illustrate how the GPU can reduce the execution time of the DPSO algorithm, we compare three versions of DPSO implementations: the CPU version, the DPSO-GPU-1, and the DPSO- GPU-2. The only difference between the CPU and GPU versions is the evaluation of the CCD function.

We chose grid level q=11 on the hypercubic experimental region and executed the three versions of the codes with 50 particles and 200 iterations in the DPSO algorithm to find the optimal 11- point design. In each execution of the DPSO algorithm, we evaluated the CCD function 10,000 times. We compared execution time between different versions for dimensions K = 2, 3, 4, 5. As

(13)

shown in Figure 4, the computational load grows in terms of the dimension, and the time saving due to the GPU also grows. DPSO-GPU-2 achieves a 55X time saving for K = 5.

4.2. Numerical results and comparisons for K = 2 and 3

Algorithm 2 The Threshold Accepting (TA) Algorithm used in Lin et al. (2010).

Require: nRounds, nSteps, initial solution x^c∈ Ω, Ω is the solution space.

Ensure: Optimal solution x^c∈ Ω

Initialize threshold sequence {τ_r}, r = 1, · · · , n_Rounds for r = 1 to n_Rounds do

for i = 1 to n_Steps do

randomly move to neighborhood solution xⁿ∈ Ω if f (xⁿ) − f (x^c) < τr then

x^c= xⁿ end if end for end for

In this section, the nearly uniform design for several different flexible regions is studied based on pre-specified grid point sets. Lin et al. (2010) has used the Threshold Accepting (TA) algorithm to find the U -type design over the flexible regions based on the CCD criterion. Here we also compare DPSO (Algorithm 1 with DPSO-GPU-2 implementation) with the TA. Note that the TA is a local search method proposed by Dueck and Scheuer (1990), and the details of the TA algorithm are shown in Algorithm 2.

According to Lin et al. (2010), the flexible parameters are chosen from {9999, 2, 1, 0.5, 0.3}. Lin et al. (2010) has shown the search results for K = 2 and 3. In addition to these two cases, we further consider the cases that K = 4 and 5. The numbers of design points are set to be 5 and 11 similar to Lin et al. (2010). In DPSO, we set probG = probP = 0.4 for all cases. For the

“RandMove” step in DPSO, probR = 0.05 for m = 9999, 2, 1, 0.5, and probR = 0.1 for m = 0.3.

To make fair comparisons, both the DPSO and TA algorithms are set to have the same number of function evaluations, which is equal to 10⁵ or 10⁶. To evaluate the algorithm stability, we independently repeat the experiment 100 times for each case of the flexible regions. In particular, we consider the following settings:

TA: 1,000 inner iterations, 100 outer iterations, total 10⁵ function evaluations.

DPSO: 200 particles, 500 iterations, total 10⁵ function evaluations.

TA+: 1,000 inner iterations, 1,000 outer iterations, total 10⁶ function evaluations.

DPSO+: 500 particles, 2,000 iterations, total 10⁶ function evaluations.

First, we consider the case in which K = 2, 3. The comparison results are summarized via box-plots shown in Figures 5 to 7. To remove the scale effects for all cases, we draw the box-plot

(14)

(a) K = 2, n = 5, m = 9999 (b) K = 2, n = 11, m = 9999

(c) K = 2, n = 5, m = 2 (d) K = 2, n = 11, m = 2

(e) K = 2, n = 5, m = 1 (f) K = 2, n = 11, m = 1

(g) K = 2, n = 5, m = 0.5 (h) K = 2, n = 11, m = 0.5

Figure 5: Box-plot comparisons of the TA and DPSO algorithms on dimension K = 2, 3 with different point number

(15)

(a) K = 2, n = 5, m = 0.3 (b) K = 2, n = 11, m = 0.3

(c) K = 3, n = 5, m = 9999 (d) K = 3, n = 11, m = 9999

(e) K = 3, n = 5, m = 2 (f) K = 3, n = 11, m = 2

(g) K = 3, n = 5, m = 1 (h) K = 3, n = 11, m = 1

Figure 6: Box-plot comparisons of the TA and DPSO algorithms on dimension K = 2, 3 with different point number

(16)

(a) K = 3, n = 5, m = 0.5 (b) K = 3, n = 11, m = 0.5

(c) K = 3, n = 5, m = 0.3 (d) K = 3, n = 11, m = 0.3

Figure 7: Comparison of the TA and DPSO algorithms on dimension K = 2, 3 with different point number n and flexible parameter m.

(17)

Table 2: The CCD values (scaled by multiplying 10³) found by DPSO with function evaluation I = 10⁵, 10⁶, K = 4, various point numbers n, and flexible parameter m. Average timing results are shown in seconds.

(a) I = 10⁵

n Best Median Std Time²

m = 9, 999

5 66.738 66.955 0.201 56.3 11 38.323 38.950 0.285 58.8

m = 2

5 66.425 66.653 0.258 58.3 11 37.114 37.897 0.451 71.8

m = 1

5 65.520 65.718 0.284 59.5 11 37.014 37.257 0.232 74.3

m = 0.5

5 62.993 63.245 1.050 58.4 11 34.672 35.294 0.351 78.5

(b) I = 10⁶

n Best Median Std Time²

m = 9, 999

5 66.738 66.738 0.175 634.9 11 37.478 37.758 0.319 686.3

m = 2

5 66.343 66.494 0.159 580.8 11 36.930 37.283 0.196 719.0

m = 1

5 65.305 65.640 0.216 592.7 11 36.444 36.827 0.259 735.3

m = 0.5

5 62.993 63.091 0.199 569.7 11 34.376 34.780 0.471 729.1

Table 3: CCD values (scaled by multiplying 10³) for optimized designs via the DPSO algorithm with function evaluation I = 10⁵, 10⁶, a number of factors K = 5, different point number n, and flexible parameter m. Average timing results are shown in seconds.

(a) I = 10⁵

n Best Median Std. Time² m = 9, 999

5 55.598 55.843 0.139 362.9 11 32.990 33.533 0.345 550.3

m = 2

5 55.806 55.925 0.227 342.3 11 33.080 33.234 0.265 488.8

m = 1

5 55.231 55.503 0.167 397.5 11 32.673 33.107 0.345 599.6

(b) I = 10⁶

n Best Median Std. Time² m = 9, 999

5 55.727 55.760 0.080 3613.0 11 32.552 32.811 0.133 5387.7

m = 2

5 55.769 55.838 0.137 3336.8 11 32.100 32.512 0.226 6541.9

m = 1

5 55.318 55.437 0.133 3889.8 11 31.957 32.302 0.213 7580.0

as follows. In each case, the median of the CCD value obtained by TA is used as a principle value.

Then, the box-plots are plotted based on the ratios of the CCD values and the median CCD values by TA. These figures suggest the following observations:

• Under the same number of function evaluations, DPSO (DPSO+) outperforms TA (TA+) in terms of the minimal values, the range, and the inter-quartile range of the CCD values.

• DPSO (with 10⁵ function evaluations) performs better than TA+ (with 10⁶ function evaluations) in most cases.

• DPSO and DPSO+ usually obtain smaller CCD values. In some small case problems, such as K = 2, n = 5, almost all of the 100 repetitions in DPSO converge to the smallest value we can ever find.

(18)

Figure 8: Schema of the data center.

• In most cases, the DPSO algorithm leads to smaller ranges and inter-quartile ranges than those of the TA algorithms. In other words, DPSO is more stable among the 100 replications.

• For the cases that K = 3, n = 11, and m = 9999, 2, 1, the performance differences between the two algorithms decrease. This could be because the sizes of these problems are too large.

Consequently, both DPSO and TA algorithms search only a small portion of the solution space.

4.3. Numerical results for K = 4 and 5

In addition to the cases of K = 2 and 3, we also illustrate the results for K = 4, 5 obtained by DPSO in Tables 2 and 3. The results suggest that DPSO is a stable method capable of solving the large problems efficiently. Using more iteration numbers, i.e., I = 10⁶, leads to better results in terms of the minimal values and the standard deviation of the CCD values, especially for n = 11.

However, computational times for I = 10⁶ are usually 10 times more than the cost for I = 10⁵.

5. A Real Application with Slid-Rectangular Region

The proposed method is next applied to a data center thermal management problem introduced by Hung et al. (2010); Hung (2011). A data center is an integrated facility housing multiple- unit servers, providing application services or management for data processing. The objective

(19)

(a) The uniform design found by DPSO by using CCD2as the uniformity criterion. The corresponding CCD2(P) = 0.0141439

(b) The uniform design found by the DPSO by using CCD\2 as the uniformity criterion. The corresponding CCD2(P) = 0.0344518.

Figure 9: Results of uniform designs for data center modeling

of this portion of the study is to obtain an optimal sensor placement plan for monitoring the thermal distribution in the data center, which is a crucial step in designing a data center with an efficient heat-removal mechanism. The sensors should be attached to facility surfaces to measure temperature in a data center, but the facilities are often stored irregularly due to usage limitations.

Therefore, this study leads to an experimental design problem for an irregular region shown in Figure 8. Hung et al. (2011) defined such an irregular region as slid-rectangular.

In this example, we plan to select n = 24 positions among 106 feasible ones over the slid- rectangular region to place sensors. This example is intended to demonstrate the performance of DPSO in extending uniform design searches to disconnected regions. According to the definition of discrete CCD shown in Definition 4, the CCD criterion works whether the experimental region is connected or not. By using DPSO with the discrete CCD as the selection criterion, we can generate a uniform design over the slid-rectangular region as illustrated in Figure 9(a). The corresponding CCD2value is 0.0141439.

By the definition of the uniform design (1), the non-collapsing property might not hold. Based on the optimal uniform design shown in Figure 9(a), projection of the design points onto the x-axis leads to some replicates at x = 5, 7, 9, 14, which violates the non-collapsing property. To further ensure this non-collapsing property, we suggest modifying CCD by adding a penalty term for replicates. For this example where replicates occur on the x-axis, we modify the CCD criterion by

CCD\₂(P) = CCD₂(P) ×p

number of collapsing on x-axis in P + 1

The DPSO algorithm remains applicable to optimize \CCD2. Generated by DPSO, the optimal

(20)

\CCD₂ design is illustrated in Figure 9(b). The corresponding CCD₂ value is 0.0346129 (i.e., transformation from \CCD2 to CCD2).

For the same example, we note that the probability-based Latin Hypercubic design (PLHD) reported in Hung (2011) has a CCD2 value 0.0384592, which is more than 10% higher than the optimal \CCD₂ design we found by using DPSO. This result indicates that the DPSO algorithm can obtain designs with better uniformity then PLHDs in terms of CCD2 measurement.

6. Conclusion

We have studied how optimal CCD-based uniform designs for irregular experimental regions can be determined efficiently. In the theoretical aspect, we have conducted the complexity analysis for the discrete CCD. In the computational aspect, we have developed the DPSO to find optimal designs with respect to the discrete CCD criterion. Our implementation of DPSO is also accelerated by using the state-of-the-art GPU parallelism. Numerical results on two- and three-dimensional flexible regions show that the DPSO algorithm outperforms the TA algorithm in both efficiency and stability. Efficiency of DPSO is analyzed numerically for four- and five-dimensional flexible regions. By solving a real world application, we further demonstrate how DPSO can be extended handle disconnected irregular experimental regions and avoid collapse.

Several topics warrant further investigation. To generate non-collapsing uniform design, we have suggested modifying the uniformity measure by adding a penalty for collapsing designs. One example is shown in Section 5. However, designing this penalty term may be related to the experimental regions and is an interesting research problem. Another possible point of future research is to extend the DPSO algorithm to generate a nested spacing-filling design (Qian et al., 2009). Rennen et al. (2010) have proposed a numerical method for constructing the nested spacing- filling design based on a specific optimal criterion. It would be interesting to see how DPSO works for such nested design problems, which can be formulated as a constrained optimization problem.

Finally, it is possible to further improve the computational efficiency by using multiple GPU. Some techniques can be found in Hung and Wang (2012).

Acknowledgments

The research of Hung is supported by National Science Foundation grants DMS 0905753 and CMMI 0927572. This specific work was supported in part by the National Science Council under grant NSC 99-2118-M-006-006-MY2 (Chen) and NSC 100-2628-M-002-011-MY4 (Wang), the Taida Institute of Mathematical Sciences, the Center for Advanced Study in Theoretical Sciences, the National Center for Theoretical Sciences (Taipei Office) and the Mathematics Division of the National Center for Theoretical Sciences (South) in Taiwan.

Auffray, Y., Barbillon, P., and Marin, J.-M. (2012). Maximin design on non hypercube domains and kernel interpolation. Statistics and Computing, 22:703–712.

(21)

Chuang, S. C. and Hung, Y. C. (2010). Uniform design over general input domains with applications to target region estimation in computer experiments. Computational Statistics & Data Analysis, 54:219–232.

Dragulji´c, D., Dean, A. M., and Santner, T. J. (2012). Noncollapsing space-filling designs for bounded nonrectangular regions. Technometrics, 54:169–178.

Draper, N. R. and Guttman, I. (1986). Responses surface designs in flexible regions. Journal of the American Statistical Association, 81(396):1089–1094.

Dueck, G. and Scheuer, T. (1990). Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. Journal of Computational Physics, 90(1):161 – 175.

Fang, K.-T. (1980). The uniform design: application of number-theoretic methods in experimental design. Acta Mathematical Application Sinica, 3:363–372.

Fang, K. T., Lin, D. K. J., Winker, P., and Zhang, Y. (2000). Uniform design: theory and application. Technometrics, 42:237–248.

Fang, K.-T., Lu, X., and Winker, P. (2003). Lower bounds for centered and wrap-around l2- discrepancies and construction of uniform designs by threshold accepting. Journal of Complexity, 19:692–711.

Fang, K.-T., Ma, C.-X., and Winker, P. (2002). Centered L2-discrepancy of random sampling and latin hypercube design, and construction of uniform designs. Mathematics of Computation, 71:275 – 296.

Hickernell, F. J. (1998). A generalized doscrepancy and quadrature error bound. Math. Comput., 67:299–322.

Hua, L. K. and Wang, Y. (1981). Applications of number theory to numerical analysis. Springer and Science Press, Berlin and Beijing.

Hung, Y. (2011). Adaptive probability-based latin hypercube designs. Journal of the American Statistical Association, 106(493):213–219.

Hung, Y., Amemiya, Y., and Wu, C. F. J. (2010). Probability-based latin hypercube designs for slid-rectangular regions. Biometrika, 97(4):961 – 968.

Hung, Y. and Wang, W. (2012). Accelerating parallel particle swarm optimization via gpu. Opti- mization Methods and Software, 27:33–51.

Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization. In Proceedings of IEEE Inter- national Conference on Neural Networks, volume 4, pages 1942–1948. Piscataway, NJ: IEEE.

Kennedy, J. and Eberhart, R. (2002). Particle swarm optimization. In Neural Networks, 1995.

Proceedings., IEEE International Conference on, volume 4, pages 1942–1948.

(22)

Lin, D. K. J., Sharpe, C., and Winker, P. (2010). Optimized u-type designs on flexible regions.

Computational Statistics & Data Analysis, 54(6):1505–1515.

Ma, C. and Fang, K.-T. (2004). A new approach to construction of nearly uniform designs.

International Journal of Materials and Product Technology, 20:115 – 126.

NVIDIA (2010). NVIDIA CUDA Programming Guide Version 3.0.

Qian, P. Z. G., Ai, M., and Wu, C. F. J. (2009). Construction of nested space-filling designs. The Annals of Statistics, 37:3616–3643.

Ranjan, P., Bingham, D., and Michailidis, G. (2008). Sequential experiment design for contour estimation from complex computer codes. Technometrics, 50(4):527–541.

Rennen, G., Husslage, B., and van Dam, E. R. (2010). Nested maximin latin hypercube designs.

Struct Multidisc Optim, 41:371–395.

Santner, T. J., Williams, B. J., and W., N. (2003). The Design and Analysis of Computer Experi- ments. Springer.

Stinstra, E., Hertog, D., Stehouwer, P., and Vestjens, A. (2003). Constrained maximin designs for computer experiments. Technometrics, 45:340–346.

Toala, D. J. J., Bressloff, N. W., Keanea, A. J., and Holden, C. M. E. (2011). The development of a hybridized particle swarm for kriging hyperparameter tuning. Engineering Optimization, 43:675 – 699.

Appendix A. Appendix

Appendix A.1. Computational details of CCD_p

We illustrate the calculation detail of CCD₂(P) in the following example. The example design is P = {p1= (1/6, 1/2), p2= (1/2, 1/2)} contains two design points (n = 2) on domain ¯D(3) (see Definition 3). Here, D = {x = (x₁, x₂)||2(x₁− 0.5)| + |2(x₂− 0.5)| ≤ 1}. As K = 2 and m = 2, the inner summation index is k = 1, 2, 3, 4. For every point x ∈ ¯D, we divide the region D into four sub-regions: D₁(x) = {r = (r₁, r₂) ∈ D|r₁≤ x1, r₂≤ x2}, D2(x) = {r = (r₁, r₂) ∈ D|r₁> x₁, r₂≤ x2}, D3(x) = {r = (r1, r2) ∈ D|r1 ≤ x1, r2 > x2}, D4(x) = {r = (r1, r2) ∈ D|r1 > x1, r2 > x2}.

In this example, ¯D(3) = {x₁= (1/6, 1/2), x₂ = (1/2, 5/6), x₃= (1/2, 1/2), x₄ = (1/2, 1/6), x₅ = (5/6, 1/2)} which contains five grid points. That is, | ¯D| = 5.

From Eq. (4), we can see that we must compute N ( ¯D_k(x), P), | ¯D_k(x)| at these five grid points.

For example, at x1, as shown in Figure 10(b), N1(x1) = 0, N2(x1) = 0, N3(x1) = 2, N4(x1) = 0,

| ¯D₁(x₁)| = 1, | ¯D₂(x₁)| = 0, | ¯D₃(x₁)| = 3, and | ¯D₄(x₁)| = 1. At x₂, as shown in Figure 10(c),

(23)

(a) (b) (c)

(d) (e) (f)

Figure A.10: Examples for calculating approximate CCD.

Figure A.11: A trivial example: 1-point design P = {(1/2, 1/2)} on domain D = [0, 1]².

(24)

N₁(x₂) = 1, N₂(x₂) = 1, N₃(x₂) = 0, N₄(x₂) = 0, | ¯D₁(x₂)| = 1, | ¯D₂(x₂)| = 3, | ¯D₃(x₂)| = 0, and

| ¯D4(x2)| = 1. In short, we have

CCD2(P)²=1 5{1

4[(0 2−1

5)²+ (0 2−0

5)²+ (2 2−3

5)²+ (0 2−1

5)²] +1

4[(1 2 −1

5)²+ (1 2 −3

5)²+ (0 2 −0

5)²+ (0 2 −1

5)²] +1

4[(2 2 −3

5)²+ (0 2 −1

5)²+ (0 2 −0

5)²] +1

4[(2 2 −4

5)²+ (0 2 −0

5)²+ (0 2 −1

5)²+ (0 2 −0

5)²] +1

4[(2 2 −4

5)²+ (0 2 −1

5)²+ (0 2 −0

5)²]}

and therefore CCD2(P) = 0.197484176581315

In addition, we compute CCD₂and CCD₂for different grid levels q of a trivial example shown in Figure A.11. In this example, we have the explicit form of the CCD integral. Therefore, we can compute the exact value of CCD₂(P)².

CCD₂(P)²= Z

D

1 4

4

X

k=1

|N (x) 1 −v(x)

1 |²dA (by symmetry)

= Z

[0,¹₂]²

1

4|(xy)²+ [(1 − x)y]²+ [x(1 − y)]²+ [1 − (1 − x)(1 − y)]²| dA

= Z 1/2

0

Z 1/2 0

1

4|(xy)²+ [(1 − x)y]²+ [x(1 − y)]²+ [1 − (1 − x)(1 − y)]²| dx dy

= Z 1/2

0

1

3x³y²−1

3(1 − x)²y²+1

3x³(1 − y)² + [x + (1 − x)²(1 − y) − 1

3(1 − x)³(1 − y)²]|^1/2_x=0dy

= Z 1/2

0

1

24y²+ 7

24(1 − y)²+ [1 2 −3

4(1 − y) + 7

24(1 − y)²] dy

=1

72y³+ 7

72y³− 1

72(1 − y)³+ [1 2−3

4(1 − y) + 7

24(1 − y)²]|^1/2_y=0

=23/288

That is, CCD2(P) = (23/288)¹² .

= 0.282597083. We also compute CCD2(P) for grid level = 11, 51, and 101 and obtain the following results. For q = 11, CCD2(P) .

= 0.2814774653. For q = 51, CCD2(P) .

= 0.2825451145. For q = 101, CCD2(P) .

= 0.2825838355.