Data Depth - 利用單體資料縱深測度建構剖面資料之無母數監控方法

2.2.1 Introduction

We usually analyze multivariate data or profiles under the normality assumption, for which the characteristics of the data can be estimated using classical statistical methods. Many multivariate statistical methods have been developed under normality, an assumption often not easy to justify or violated in some practical experiments. Hence nonparametric methods for multivariate analysis are desirable. Data depth is completely nonparametric because it analyzes data based on the relative position or rank of the data points without parametric assumptions on the underlying distribution.

A data depth is a measure for measuring the “centrality” or “outlyingness” of a multi-variate observation with respect to a set of reference data points (or their probability distri-bution). It provides a natural center-outward ordering of data points in a given sample. So we can utilize data depth to reduce each multivariate observation (or quantify some complex features of the underlying distribution) to its univariate center-outward rank. In general, the greater the depth of a point is, the more densely it is surrounded by other sample points.

For example, in R, the median of a given set of points on the real line has the maximum depth. In R², a point with high depth corresponds to “centrality”; on the other hand, low depth corresponds to “outlyingness”. A point has high depth when it is centrally located in the sample points.

Over the years, a large number of depth measures have been proposed. Existing data depths [14] include: Mahalanobis depth, half-space depth, simplicial depth, projection depth, spherical depth, majority depth, location depth, Oja depth, zonoid depth, L-1 depth, etc.

Different notions of depth are capable of capturing different characteristics and may lead to different ordering schemes. However, all the depth orderings are based on the notion of center-outward ranking. All the notions of depth produce their “deepest” points, which have been considered as multivariate medians. For convenience, the simplicial depth will be used for the demonstrations throughout the paper.

2.2.2 Simplicial Depth

The simplicial depth of a point X with respect to a probability distribution F on R² is the probability that X belongs to a random triangle in R². The simplicial depth of a point X with respect to a data set S in R² is defined by Liu [9] as the proportion of the triangles (constructed by three of the data points) that contain the point X (a point on the boundary is considered as contained in the triangle). The dimension can be easily extended to R^p, p > 2, but we only consider the bivariate setting here. In this paper, we utilize simplicial depth as a measure of centrality of a given point relative to a given sample in R². In general, a point with larger depth value indicates that the point is contained in many triangles constructed from the data set, so the point lies deeper within the data set.

It appears to require O(m³) computer operations to calculate the simplicial depth of a point X relative to a set of m points. Rousseeuw and Ruts [18] proposed a faster algorithm that computes the simplicial depth in O(m log m) operations, by combining geometric prop-erties with certain sorting and updating mechanisms. They implemented the algorithm and the “naive” method to verify the result. For instance, the efficiency of the algorithm is about 90000 times as fast as that of the “naive” method when m = 1000. The algorithm is very useful because in our simulation study, we require that the simplicial depth be computed at many X⁰s. Masse and Plante in 2009 compiled a package named “depth” in statistical software R based on Fortran code from Rousseeuw and Ruts [18]. The description of the package is available at http://cran.r-project.org/web/packages/depth/depth.pdf.

The simplicial depth of a point X with respect to a continuous distribution F is defined as

SD_F(X) = P_F{X ∈ 4(X_i₁, X_i₂, X_i₃)}, (3) where X_i₁, X_i₂, X_i₃ are three random observations from the distribution F . When the dis-tribution F is unknown, a sample version of simplicial depth is defined as follows. Let T be the set of all triangles formed by vertices from a reference sample {X₁, . . . , X_m} following distribution F . Each triangle requires three vertices, so T has ¡_m

¢ triangles assuming all m points are distinct and any three points are not on a line. For any X ∈ {X₁, . . . , X_m}, the

sample version of simplicial depth SD_F_m(X) is defined as otherwise. It is shown in Liu [10] that SD_F_m(·) converges uniformly and strongly to SD_F(·) under some regularity conditions. So we can approximate SD_F(·) by SD_F_m(·) when F is unknown. And it is shown in Liu [10] that Liu’s control charts are coordinate free because SD_F(·) is affine invariant.

Let {Y₁, . . . , Y_n} be the new observations to be monitored. Assume Y₁, . . . , Y_n are i.i.d.

following a continuous distribution G. The monitoring scheme is aimed at comparing G with F by testing if there exist differences between F and G. We might attribute the differences to a location shift and/or a scale shift. Now, in order to test if there is any difference, we need to calculate the simplicial depth of all X ∈ {X₁, . . . , X_m} and Y ∈ {Y₁, . . . , Y_n}. It is with probability one that Y_i 6∈ {X₁, . . . , X_m} for any i ∈ {1, . . . , n}. We remark that when computing the simplicial depth of a new observation Y with respect to a reference sample, we should treat Y as one of the sample points. Hence the computation should be based on the total number of triangles generated from the m + 1 points {Y , X₁, . . . , X_m}, and count those triangles containing the point Y . Then the simplicial depth of Y can be calculated by

SD_F_m^∗(Y ) =

In the following, we introduce three control charts given in the literature, which are constructed based on the simplicial depth values of multivariate observations to detect si-multaneously the location change and/or the scale increase in a process [11] [14].

2.2.3 r-charts

Liu [11] proposed a control chart called the r-chart that monitors the values of the relative rank (r-value) of new observations with respect to a distribution or a reference sample. The r-value of an observation Y is defined as

r_F(Y ) = P {SD_F(X) < SD_F(Y )|X ∼ F }, (6)

or, for the sample version,

rFm(Y ) = 1 m

Xm i=1

I(SDFm(Xi) < SDF_m^∗(Y )). (7)

It indicates the relative position of point Y with respect to the reference sample {X₁, . . . , X_m}.

A large r-value indicates that there are many points in the reference sample more outlying than point Y . Conversely, a small r-value means that Y is located at an outlying position with respect to the reference sample, which means that Y is unlikely to come from the same distribution F as that of the reference sample. Thus, a very small r-value of an observation Y would suggest a possible deviation from the in control state of the process. This is the main idea behind the r-chart and the other two charts.

Briefly speaking, the r-chart is analogous to the X-chart in the univariate case (also called the individual control chart), but it monitors the r-values {r_F_m(Y₁), . . . , r_F_m(Y_n)} rather than the original value of {Y₁, . . . , Y_n}. Suppose the false-alarm rate is set at a. Now we can choose the center line CL = 0.5 and the lower control limit LCL = a, based on the following proposition, which was established in Liu [11].

Proposition 2.2.1 Assume that F = G and Y ∼ G. Let U [0, 1] denote the uniform dis-tribution on the interval [0, 1], and let the notation −→ stand for the convergence in law. If^L SD_F(Y ) has a continuous distribution, then

(1) r_F(Y ) ∼ U[0, 1];

(2) as m → ∞, r_F_m(Y ) −→ U[0, 1] along almost all {X^L ₁, . . . , X_m} sequences, provided that SD_F_m(·) converges to SD_F(·) uniformly as m → ∞.

The process is considered to be out of control if r_F_m(Y ) falls below LCL = a. It means there is quality deterioration such as loss of accuracy and/or loss of precision in quality control.

The r-chart with LCL = a corresponds to an a-level test of the following hypotheses:

H₀ : F = G vs. H_a: there is a location shift and/or a scale increase from F to G. (8) In particular, while many r-values falling below LCL = a would indicate there is quality deterioration, many r_F_m(Y )’s close to 1 would suggest a possible reduction in dispersion, which may indicate a process improvement in reality.

If we are sure that there is no location change, the r-chart could be revised to detect the scale change only. Liu, Singh, and Teng [14] suggested that we could remove any possible location change by centering all data to the same location, i.e., subtracting the deepest point (with largest simplicial depth) from all data. Then we use the centered data to construct the r-chart with CL = 0.5, LCL = a/2, and UCL = 1 − a/2.

2.2.4 Q-charts

Liu [11] proposed another control chart called the Q-chart that monitors the Q-values cal-culated from the r-values. The idea of the Q-chart is analogous to that of the univariate X-chart. It plots the subgroup averages of consecutive r-values. Denote the subgroup size¯ by q. Now we give the notation of Q-values. Denote the average of the r_F(Y_i)’s (r_F_m(Y_i)’s)

The center line is always set at CL = 0.5, but the LCL depends on the choice of q. The following result regarding LCL was given in Liu [11] and Liu, Singh, and Teng [14]. When q is large, by the Central Limit Theorem, the LCL is approximately 0.5 − z_a(12q)^−1/2 for plotting Q(F, G^j_q)’s and 0.5 − z_a[(1/m + 1/q)/12]^1/2 for plotting Q(F_m, G^j_q)’s. When q is relatively small and a ≤ 1/q!, then LCL is exactly (q!a)^1/q/q. In particular, this LCL could be a reasonable approximation when a is slightly over 1/q!. Similar to the r-chart, the Q-chart could be used to detect only scale changes by using the centered data as described before.

For a given reference sample {X₁, . . . , X_m} ∼ F and an incoming new sample Y ∼ G, Liu and Singh [13] proposed a quality index Q(F, G) = P {SD_F(X) ≤ SD_F(Y )|X ∼ F, Y ∼ G} (= E_G(r_F(Y ))), where r_F(Y ) = P {SD_F(X) < SD_F(Y )|X ∼ F }, and used it to measure the difference between the distributions F and G. The previous definitions of Q(F, G_q) and

Q(F_m, G_q) are sample approximations of Q(F, G). Based on this index, Liu, Singh, and Teng [14] showed that the Q-chart could be inefficient in detecting a minor location shift.

They proposed the following data-depth-moving-average chart (DDMA-chart) to overcome this drawback.

2.2.5 DDMA-charts

The idea of the DDMA-chart is analogous to that of the univariate Moving Average chart.

The DDMA-chart monitors the DDMA-values calculated from the moving averages of the original reference sample {X₁, . . . , X_m} and new observations {Y₁, . . . , Y_n} as follows. Let q values, we can calculate the new r-value for each moving average ˜Y_i with respect to the reference sample { ˜X₁, . . . , ˜X_m−q+1} by

Since the DDMA-chart is the r-chart of moving averages, it has CL = 0.5 and LCL = a. The

only difference is the data used for calculating simplicial depth values: the chart uses r-values of individual data points, while the DDMA-chart uses r-r-values of the moving averages of q data points.

Liu, Singh, and Teng [14] explained why the DDMA-chart is more sensitive to minor location shifts than the Q-chart, and yet retaining the same ability in detecting scale shifts.

More specifically, let the length of moving window be q > 1 and assume there is only a location shift between F and G. Then the DDMA-chart will exhibit a location shift of √

q times in size. In other words, the DDMA-chart will amplify the effect of the location shift by a factor of √

q. If the proportion of the points falling below LCL is larger than the false-alarm rate a, it may suggest that there is a location shift and/or a scale increase between the distributions F and G. Furthermore, if the proportion increases as q increases, then there is a strong indication of a location shift. On the other hand, if the proportion does not increase as q increases, then it indicates a scale shift between the distributions F and G. In summary, the DDMA-chart ameliorates the Q-chart in terms of the detecting power of location shifts while retains the same detecting power of scale shifts as the Q-chart. Hence both Q-chart and DDMA-chart are suggested to be used side by side in general practice.

If we observe that there is a same effect in both the Q-chart and DDMA-chart, we could conclude that there occurs a scale shift only. If we observe that the out-of-control proportion in the DDMA-chart is larger than that in the Q-chart based on the same moving window (q), then we could conclude that there is a location change in the process, in addition to potential scale changes.

3 Methodology

3.1 Data Smoothing

Data smoothing techniques are used to “eliminate” noise and extract real trends and patterns of profiles. For a given nonlinear profile, it returns a profile that contains less noise than the original profile and yet retains the basic shape and important features of interest in the original data. The most popular approach is to utilize the basis function expansion, such as Fourier, spline, power, exponential, wavelet bases, and so on. In this paper, we adopt the

spline smoothing by fitting a cubic smoothing spline to the data.

We use the command named “smooth.spline” in statistical software R to perform data smoothing. More commands are available in R for data smoothing, for examples “splineDesign”

or “bs”, “locpoly”, “ksmooth” corresponding to other commonly used methods, B-spline gression, local polynomial smoothing, and kernel regression smoother, respectively. We re-mark based on our experiences that, by filtering out noises, the actual signals could be better extracted from the data and the subsequent principal component analysis (PCA) could ex-plore the variation among the profiles more effectively. In particular, smoothing tends to be more advantageous as the noise level (σ_²²) gets larger.

在文檔中利用單體資料縱深測度建構剖面資料之無母數監控方法 (頁 17-24)