國 立 交 通 大 學
統計學研究所
碩 士 論 文
單調無母數迴歸在廣義線性模型上之應用
Nonparametric Monotone Regression for
Generalized Linear Models
研 究 生 : 文誠智
指導教授 : 洪志真 博士
單調無母數迴歸在廣義線性模型上之應用
Nonparametric Monotone Regression for
Generalized Linear Models
研 究 生:文誠智
Student
:Cheng-Chih Wen
指導教授:洪志真
Advisor
:Dr. Jyh-Jen Horng Shiau
國 立 交 通 大 學
統計學研究所
碩 士 論 文
A Thesis
Submitted to Institute of Statistics
College of Science
National Chiao Tung University
in Partial Fulfillment of the Requirements
for the Degree of
Master
in
Statistics
June 2007
Hsinchu, Taiwan, Republic of China
單調無母數迴歸在廣義線性模型上之研究
研究生:文誠智 指導教授:洪志真 博士
國立交通大學統計學研究所
摘要
本篇文章裡,為解決WAT(Wafer Acceptance Test)-EC(Engineering
Control) 的 問 題 , 我 們 發 展 了 單 調 無 母 數 迴 歸 。 藉 由 Gu(2002) ,
Zhang(2004)所提出的方法加以結合及修正,將反應變數拓展至整個指
數族上,與此相關的演算法也會在本文中提出。我們利用Natural Cubic
Splines的性質發展出有效率的計算法,並用模擬資料來探討其效率。
當反應變數為Bernoulli或Poisson分部時,其模擬的結果都有不錯的
表現。此外,在“真實函數"具有單調性的情形下,有單調限制估計
量之ASE(Averages Square Error)與無單調限制並沒有明顯差異。然
而,當無單調限制之估計量呈現出非單調時,則單調限制估計量在ASE
上之表現會明顯優於前者。最後,我們將說明如何利用此方法來篩選
EC中的WAT測試項目並且建立適當的管制上下限。
Nonparametric Monotone Regression for
Generalized Linear Models
Student: Cheng-Chih Wen
Advisor: Dr. Jyh-Jen Horng Shiau
Institute of Statistic
National Chiao Tung University
Abstract
In this study, motivated by the WAT-EC problem, we develop a nonparametric monotone smoothing spline smoother for analyzing responses from exponential families by combining the methodologies provided in Gu (2002) and Zhang (2004) along with our modification. An algorithm with implementation details is provided. Computation is efficient because we utilize the characteristics of the natural cubic splines. The effectiveness of the proposed method is studied by simulation. The simulation results demonstrate that the proposed method performs well in the regression models with both the Bernoulli and Poisson responses. When the “true” function is monotonic, the proposed monotone estimator performs about the same as the unconstrained smoother in terms of the averaged squared error for the cases when the latter performs well. On the other hand, constrained smoother outperforms the unconstrained smoother when the unconstrained smoother produces non-monotone estimates. As an illustrative example, we demonstrate the proposed method can be used in screening WAT test items for more stringent engineering control and in setting appropriate control limits.
誌 謝
陳鄰安教授曾經說過:「數學可以因數學而數學,統計不能因統計而統計」,我
用了兩年的碩士生涯才體會到這兩句話的涵義。不僅如此,除了在統計方面學到許多
新知識,也認識很多好朋友。對於此篇論文的完成,我要感謝我的指導老師 洪志真
教授仔細認真的批閱,以及教導我做學問應有的態度,並耐心回答我在論文上所遇到
的問題,使我順利完成論文。也要感謝我的研究所同學們俊睿、益銘、花花、雪芳、
永在和建威,給我鼓勵與歡笑,帶給我美好的碩士生回憶。
最後要感謝博士班的碩慧學姐,除了論文上的協助外,也時常不吝惜分享其人生
經驗與價值觀,讓我了解自己欠缺的知識,並在我人生的規劃中給予適時的建議,受
益良多。
此外還要感謝家人的支持,以及女朋友的體諒,使我可以毫無顧慮地專心完成論
文。僅將此論文獻給我的老師、學姐、家人以及朋友們
文 誠 智 謹誌于
國立交通大學統計研究所
中華民國九十六年六月
Contents
1 Introduction 1
2 Literature Review 5
2.1 Smoothing Splines - Natural Cubic Splines . . . 5
2.2 Monotone Smoothing Splines . . . 8
2.3 Regression with Responses from Exponential Families . . . 10
3 Proposed Method 11 3.1 Penalized Weighted Least Squares Function Estimator for Exponential Families . . . 11
3.2 Monotone Natural Spline Estimator . . . 12
3.3 Algorithm . . . 15
4 Simulation Studies 16 4.1 Performance Evaluation . . . 16
4.1.1 Bernoulli Data . . . 16
4.1.2 Poisson Data . . . 17
4.2 Monotone vs. Constraint-Free Smoothing Spline Estimator . . . 18
5 Examples 20
6 Conclusions 21
A Appendix: Proofs 23
1
Introduction
The Wafer Acceptance Test (WAT) in semiconductor manufacturing is aimed at mon-itoring whether the electric characteristics of devices, such as voltage, current, and re-sistance, are regular or not. In Fab, every wafer must go through WAT device testing. In addition, a more stringent control, hereinafter referred to as “engineering control (EC)”, is further imposed on passing wafers. For implementing EC, engineers would need to select a set of critical WAT test items and determine more stringent “control” limits (than that of the regular WAT) for each of them. A wafer will be “held” up for further investigation when any of the EC items fails (i.e., exceeding the prescribed EC limits). However, without any objective assessment on how critical the WAT test items are toward the yield, engineers tend to select potential items as many as possible (sometimes even in hundreds) to perform the extra engineering control. Unfortunately, more than tolerable number of false alarms often occur with this practice. As a result, annoyed by the excess number of false alarms, engineers tend to ignore them, despite the extra efforts and costs spent in performing EC. Hence, to make EC more effective, it would be helpful if an assessment tool for screening EC test items is available for engineers and for evaluating the adequacy of the predetermined EC control limits as well.
Motivated by the above problem, as an assessment tool, we propose developing an EC performance curve for each WAT test item. The proposed EC performance curve of a test item aims at presenting the relationship between the EC passing rate and the
circuit probe (Cp) yield of the wafers, where the EC passing rate is the probability that
a randomly selected wafer passes this EC test item and the Cp yield is the proportion
of the chips on the wafer that pass the functional test called the circuit probe test. The logic behind defining the EC performance curve as such is that an effective control of
a critical EC test item would improve the process, which in turn leads to a higher Cp
yield. Thus an EC test item that can discriminate the Cp yield to some extent would be
1 displays the EC performance curves of three EC test items showing respectively the
probabilities, denoted by p(Cp) as a function of the Cp yield, of a wafer passing these
EC test items. The dashed curve depicts that the passing probability is one or almost
one for Cp > .15, indicating that most of wafers with Cp greater than .15 would pass
this test easily. In other words, such an EC test item can only discriminate low-yield wafers. Conversely, the dotted curve discriminates only wafers with high yields. The
solid curve discriminates better for the middle values of the Cp yield. Thus engineers
can pick the EC test items based on the process under study or monitoring. For
example, for a process with a fairly high Cp yield, an EC test with a performance curve
similar to the dotted curve may be a good candidate for the engineering control. We remark that a set of inappropriate control limits may make the performance curve of a critical-in-nature EC test item lose some or all of its discrimination power. Thus engineers may be able to adjust the control limits of an EC test to an appropriate level so that this particular EC test would have a desirable discrimination power for effective control.
Furthermore, although the WAT-EC tests and the circuit probe tests are quite different in nature, one tends to expect that the two test results of the same wafer should be somewhat positively associated. Thus it is reasonable to assume that the EC performance curves are monotonic.
An example of some similarity in nature as our WAT-EC problem is the item responses estimation problem discussed in Rossi, Wang, and Ramsay (2002). The data set consists of the responses of N examinees to n question items in a test. Assume that each item is answered either right or wrong. The authors proposed to estimate nonparametrically the probability that examinee j gets item i right from the discrete data and a covariate such as the IQ score for each of the N examinees via the EM Algorithm. Moreover, the discrimination power of the test items was also discussed in the paper.
To develop the EC performance curve for an EC test item, we adopt the non-parametric regression approach to estimating the functional relationship between the
passing rate and the Cp yield. In the nonparametric regression approach, the only
assumption on the regression function is smoothness and no functional form needs to be specified, which provides a great advantage of flexibility in function estimation and some convenience in modeling. The price we pay for adopting the nonparametric regres-sion approach instead of the parametric regresregres-sion approach is the slight inefficiency. However, this inefficiency only happens when the specified parametric regression model is adequate.
The statistical model we consider for the WAT-EC data is as follows. The indepen-dent variable (i.e., the covariate) of the regression is a random variable X representing
the Cp yield of a wafer. The dependent variable Y is the corresponding pass/fail result
of the WAT-EC test item for that wafer. Recall that p(x) is the passing probability of
the wafer with the Cp yield X = x. Then the dependent variable Y has a Bernoulli
distribution with a passing probability p(x). As mentioned before, we will estimate the EC performance curve p(·) by nonparametric monotone regression.
Although this study is motivated by the WAT-EC application, the nonparametric monotone regression estimation method developed in this work can be applied to ap-plications with the dependent variable Y from the exponential family. For example, the number of particles on a wafer with a covariate affecting the number of particles may be modeled by a Poisson distribution in which the mean number of particles may be of interest and could be described as a monotone function of the covariate. Or the number of defects in a product item may be again modeled by a Poisson distribution and the covariate could be a process condition that is monotonically associated with the number of defects when the product item was manufactured.
The topic of monotone function smoothing has been discussed for quite a long time in the literature. One of the major techniques used in the monotone regression focuses on the first derivative of the function to be estimated. Under the assumption that the random errors follow the normal distribution, Ramsay (1998) proposed expressing the first derivative of a monotone function as the exponential of a smooth function and estimating the smooth exponent by a B-spline. Under the same Gaussian model
for random errors, by adopting the penalized least squares approach, Zhang (2004) developed a simple method trying to obtain a monotone function estimate by forc-ing the estimated first derivative of the function to be non-negative or non-positive in computation. Based on the method constructed by Ramsay (1998), Wang (2000) ex-tended the distribution to the exponential family and developed a two-step algorithm to implement a monotonic regression technique. For more research works on nonpara-metric monotone regression, see Ramsay (1998), Wang (2000), Zhang (2004), and the references cited therein.
In this study, we adopt a different approach from that in Wang (2000). Instead of the exponent approach by Ramsay (1998), We combine the penalized likelihood approach given in Gu (2002) for estimating the parameter function when responses are from exponential families with the Zhang’s approach of forcing the estimated parameter function to be monotonic. With this approach, it is more natural that we use smoothing splines rather than B-splines. We remark that Zhang’s method has a hard-to-see flaw and with this flaw the monotonicity of the estimated function cannot be guaranteed. We modify Zhang’s method to ensure the monotonicity.
The rest of the paper is organized as follows. Section 2 reviews the three components we use in developing our estimation method, including smoothing splines as described in Green and Silverman (1994), Zhang’s approach to forcing a smoothing spline estimate to be monotonic, and Gu’s approach and algorithm for nonparametric regression when data are from exponential families. Section 3 describes the estimation method and the algorithm we propose in this paper. Section 4 examines the effectiveness of the proposed method by a simulation study for the cases of Bernoulli data and Poisson data. A comparison study is conducted to demonstrate the value of adding the monotone constraint. Section 5 returns to the motivated WAE-EC example and illustrates how to use the proposed method as an assessment tool. Section 6 concludes the paper with a brief summary and some discussions.
2
Literature Review
In the following subsections, we review the three components we use in developing the proposed method, including smoothing splines (in natural cubic splines), monotone smoothing splines (also in natural cubic splines), and smoothing splines for responses from exponential families.
2.1
Smoothing Splines - Natural Cubic Splines
Smoothing splines has been a very popular smoothing technique for decades. For computational purpose, we only review smoothing cubic splines as described in Green and Silverman (1994). For other aspects of smoothing splines, readers are referred to Wahba (1990), Eubank (1999), and Gu (2002) and the research works cited therein.
Consider the problem of fitting a curve from a set of noisy data {(x1, Y1), . . . , (xn, Yn)},
where xi ∈ [a, b], i = 1, . . . , n. Let S2[a, b] denote the space of functions that are
dif-ferentiable on [a, b] with absolutely continuous first derivative and square integrable
second derivative. Given any function g in S2[a, b], let the penalized sum of squares of
g be S(g) = 1 n n X i=1 {Yi− g(xi)} 2 + λ Z b a {g00(x)}2dx, (1)
where λ > 0 is the smoothing parameter controlling the tradeoff between the closeness
to data and the smoothness of the fitted curve. The smoothing spline estimate ˆg is
defined as the minimizer of the functional S(g) over all g ∈ S2[a, b].
Suppose the real numbers x1, . . . , xn are given on interval [a, b] such that a ≤ x1 <
x2 < . . . < xn ≤ b. A function g defined on [a, b] is called a cubic spline when
the following two conditions are satisfied: (i) g is a cubic polynomial on each of the
subintervals (a, x1), (x1, x2), . . . , (xn, b); (ii) the first and second derivatives of g are
continuous at each knot xi. If, in addition, the second and third derivatives of g are
zero at a and b, then g is said to be a natural cubic spline. More specifically, a natural
cubic spline is linear on the two boundary subintervals [a, x1] and [xn, b]. It is well
Wahba, 1990).
Suppose that g is a natural cubic spline on interval [a, b] with knots x1, . . . , xn.
Denote
gi = g(xi) and γi = g00(xi) for i = 1, . . . , n.
According to the definition of a natural cubic spline, the second derivative of g at x1
and xn are zero, that is, γ1 = γn = 0. Let g be the vector (g1, . . . , gn)
T
and γ be
the vector (γ2, . . . , γn−1)
T
. They constructed the following two matrices Q and R for
computation. Denote hi = xi+1− xi for i = 1, . . . , n − 1. Let Q be the n × (n − 2)
matrix with elements qij given by
qj−1,j = h−1j−1, qj,j = −h−1j−1− h −1
j , qj+1,j = h−1j ,
and qij = 0 for |i − j| ≥ 2 for i = 1, . . . , n and j = 2, . . . , n − 1. Note that the top
left element of Q is q12 and the bottom right element is qn,n−1. R is the symmetric
(n − 2) × (n − 2) band matrix with nonzero elements rij given by
rii = 1 3(hi−1+ hi) for 2 ≤ i ≤ n − 1, ri,i+1 = ri+1,i = 1 6hi for 2 ≤ i ≤ n − 2.
The matrix R is strictly diagonal dominant, that is, |rii| > Pj6=i|rij| for each i. It
follows that R is strictly positive definite. Define an n × n matrix K by
K = QR−1QT.
Green and Silverman (1994) proved that the vectors g and γ can specify a natural
cubic spline g if and only if the condition QTg = Rγ is satisfied. When this condition
holds, the roughness penalty satisfies Z b
a
{g00(t)}2dt = γTRγ = gTKg.
Return to the curve fitting problem. Since the smoothing spline estimator ˆg is a
need to search over a finite-dimensional class of functions, i.e., the natural cubic splines
with knots at the xi’s, instead of the infinite-dimensional space S2[a, b].
Let g be the natural cubic spline formed by the vectors g and γ, and matrices Q and R. Rewrite S(g) in terms of these vectors and matrices as follows. Let Y =
(Y1, . . . , Yn)T. Express the residual sum of squares about g as (Y − g)T(Y − g) and
the roughness penalty term R
{g00(x)}2dx as gTKg to obtain S(g) = 1 n(Y − g) T (Y − g) + λgTKg = 1 n n gT(I + nλK)g − 2YTg + YTYo.
Since K is non-negative definite, the matrix I + nλK is strictly positive definite. It therefore follows that S(g) has a unique minimum, which can be expressed as
g = (I + nλK)−1Y. (2)
Green and Silverman (1994) showed that the vector g can define the smoothing spline ˆ
g uniquely. That is, over the space of all natural cubic splines with knots xi, S(g) has
the unique minimum satisfying (2). Furthermore, the value of g(x) at any point x can
be specified by the vectors g and γ, where γ can be obtained by solving QTg = Rγ.
More specifically, on each subinterval [xi, xi+1], 1 ≤ i ≤ n − 1, it can be shown that
g(x) = (x − xi) gi+1+ (xi+1− x) gi hi −1 6(x − xi)(xi+1− x) 1 + x − xi hi γi+1+ 1 + xi+1− x hi γi for xi ≤ x ≤ xi+1.
If x is in the two boundary subintervals, by the fact that a natural cubic spline is linear on the boundary subintervals, we have
g(x) = g1− (x1− x)g0(x1) for x ≤ x1,
g(x) = gn+ (x − xn)g0(xn) for x ≥ xn,
where g0(x1) and g0(xn) are derivatives of g at x1 and xn, respectively, which can be
obtained by g0(x1) = g2− g1 x2− x1 − 1 6(x2− x1) γ2, g0(xn) = gn− gn−1 xn− xn−1 + 1 6(xn− xn−1) γn−1.
2.2
Monotone Smoothing Splines
Zhang (2004) proposed a simple and efficient monotone smoother based on smoothing spline estimation. The main idea is to impose a monotone constraint on the derivative of the estimated regression function.
Assume that data {(xi, Yi), i = 1, . . . , n} are sampled from the following
nonpara-metric regression model
Yi = f (xi) + i, i = 1, . . . , n,
where f (x) is an unknown smooth function with thrice continuous derivatives on
inter-val [a, b] and the random errors i’s are white noise with mean zero and standard
devi-ation σ. For simplicity, assume that the design points xi satisfy a ≤ x1 < . . . < xn ≤ b.
A smooth estimator of f can be defined as the minimizer ˆf of the following penalized
least squares criterion: 1 n n X i=1 {Yi− f (xi)}2+ λ Z {f000(x)}2dx, (3)
where again λ > 0 is the smoothing parameter.
To derive closed-form formulas for both ˆf (x) and its derivative, write f (x) in terms
of g(x) = f0(x) as
f (x) = f (x1) +
Z x
x1
g(u)du. (4)
Substitute (4) into (3) to obtain the following regularization criterion: 1 n n X i=1 {Yi− f (x1) − Z xi x1 g(x)dx} 2 + λ Z {g00(x)}2dx. (5)
Let ( ˆf (x1), ˆg) be the minimizer of (5).
By treating g as a natural cubic spline with knots xi’s for i = 1, . . . , n, Zhang
(2004) established the relationship between the function f and its derivative g using
the method of Green and Silverman (1994) and gave the closed-form formulas for ˆf
and ˆg, respectively, as in the following.
Denote fi = f (xi), gi = g(xi), and γi = g00(xi) for i = 1, . . . , n. Since g is a natural
γ = (γ2, . . . , γn−1)T. Let hi = xi+1−xi, i = 1, 2, . . . , n−1, and Q, R, K be the matrices
as defined in Green and Silverman (1994, pages 12,13) (also defined in Subsection 2.1). According to their Theorem 2.1, it can be shown that g is a natural cubic spline with
knots x1, . . . , xn if and only if γ = R−1QTg. As mentioned before,
K = QR−1QT and
Z b
a
g00(x)2dx = gTKg.
Denote the n-dimensional column vector of zeros by 0n . Let C = (c1, . . . , cn)T and
D = (d1, . . . , dn)T, where c1 = 0n, d1 = 0n−2, and, for i = 2, 3, . . . , n,
ci = (h1, h1+ h2, . . . , hi−2+ hi−1, hi−1, 0, . . . , 0)
T , di = (h31+ h 3 2, . . . , h 3 i−2+ h 3 i−1, h 3 i−1, 0, . . . , 0) T .
Note that ci and di are n × 1 and (n − 2) × 1 vectors, respectively. According to
Proposition 1 in Zhang (2004), the vector f can be expressed in terms of C, D, Q, R, and g as f = f (x1) · 1n+ { 1 2C − 1 24DR −1 QT}g.
To find the estimator of the function f , denote
M = 1 2C − 1 24DR −1 QT , ˜g = f1 g ! , ˜M = 1n M , and ˜K = 1 0 T n 0n K ! . Then the minimizer of the regularization problem (5) is
ˆ ˜ g = fˆ1 ˆ g ! = ( ˜MTM + nλ ˜˜ K)−1M˜TY, (6)
where Y = (Y1, · · · , Yn)T. We then have the estimator ˆf = ˜M ˆ˜g.
We remark that Zhang (2004) has a typo that the matrix ˜M is defined as an
(n + 1) × (n + 1) matrix ˜ M = 1 0 T n 1n M ! ,
which cannot be right since the vector Y in (6) is an n × 1 vector.
Zhang (2004) developed a simple method with an attempt to obtain to a monotone estimate. Without loss of generality, assume that f is decreasing, hence g is
this goal, Zhang (2004) replaced ˆg(x) by ˆg+(x) = max(ˆg(x), 0). He then estimated f by ˆ f = ˜M fˆ1 ˆ g+ ! , where ˆg+ = (ˆg+(x1), · · · , ˆg+(xn))T.
Unfortunately, the monotonicity of ˆf cannot be guaranteed. The reason is that the
i-th element of ˆf , denoted by ˆfi, is not exactly the value of ˆf (x1) +
Rxi
x1 gˆ+(u)du as
desired. Instead, it is ˆf (x1) +
Rxi
x1 g˜+(u)du, where the function ˜g+(·) is the natural cubic
spline interpolating {(xi, ˆg+(xi), i = 1, ..., n} and there is no guarantee that it would
be nonnegative.
2.3
Regression with Responses from Exponential Families
For the response variable Y from an exponential family distribution along with a co-variate x, consider the following conditional density function
f (y|x) = exp{yη(x) − q(η(x))
φ + c(y, φ)}, (7)
where q, and c are known functions, η(x) is the parameter function to be estimated, and φ > 0 is a known dispersion parameter independent of x. It is well known that
E[Y |x] = q0(η(x)) = µ(x) and V ar[Y |x] = q00(η(x))φ.
Assume the responses Yi corresponding to the covariate xi, i = 1, · · · , n, are i.i.d.
from the exponential family distribution (7). Then the penalized log-likelihood func-tional can be expressed as
−1 n n X i=1 {Yiη(xi) − q(η(xi))} + λ 2J (η), (8)
where J (λ) is the roughness penalty. Gu (2002) showed that the minimizer of (8) can
be computed via the Newton iteration, which updates ˜η (the η obtained in the last
iteration) by the minimizer of the following penalized weighted least squares functional 1 n n X i=1 ˜ ωi{ ˜Yi− η(xi)} 2 + λJ (η),
where ˜Yi = ˜η(xi) − ˜ui/ ˜wi with ˜ui = −Yi + q0(˜η(xi)) and ˜wi = q00(˜η(xi)). Note that
˜
3
Proposed Method
3.1
Penalized Weighted Least Squares Function Estimator for
Exponential Families
Consider the nonparametric regression for a generalized linear model specified within
the exponential family. The data observed are i.i.d. samples (xi, Yi), for i = 1, . . . , n,
from an exponential family distribution. The covariate variable xi is within some
interval [a, b] and the density of Yi is
f (y|x) = exp{yη(x) − q(η(x))
φ + c(y, φ)},
where q and c are known functions, and φ > 0 is either known or considered as a nuisance parameter. The unknown parameter function η(x) is the central aim of esti-mation. In our study, η(x) needs to meet both smoothness and monotone conditions on interval [a, b]. Instead of maximizing the log-likelihood, we choose to minimize the following penalized log-likelihood functional
−1 n n X i=1 {Yiη(xi) − q(η(xi))} + λ 2 Z b a [D(m)η(x)]2dx, (9)
where λ is the smoothing parameter.
Gu (2002) gave a quadratic approximation of −Yiη(xi) + q(η(xi)) at ˜η(xi) as
1 2ω˜i{η(xi) − ˜η(xi) + ˜ ui ˜ wi } 2 + Ci,
where ˜ui = −Yi + q0(˜η(xi)), ˜ωi = q00(˜η(xi)), and Ci is not related to η(xi). Without
imposing the constraint that η(x) is monotonic, the minimizer of the penalized log likelihood functional (9) can be obtained by recursively finding the minimizer of the penalized weighted least squares functional
l = 1 n n X i=1 ˜ ωi{ ˜Yi− η(xi)} 2 + λ Z b a [D(m)η(x)]2dx (10)
3.2
Monotone Natural Spline Estimator
To impose the monotone constraint in the estimation of η(x), we focus on the estimation of the first derivative of η(x). Since η(x) is non-increasing (or non-decreasing), the
derivative η0(x) is non-positive (or non-negative). There are reasons for dealing with
η0(x). Firstly, in practice, it is easier to impose non-positiveness (or non-negativeness)
on η0(x) than to impose monotonicity on η(x). Secondly, we can derive a closed-form
formula for η0(x) easily by the property of natural cubic splines as given in Green and
Silverman (1994).
Assume that xi, i = 1, . . . , n, are real numbers on interval [a, b], which satisfy
a ≤ x1 < x2 < . . . < xn ≤ b. Since we are more focused on the smoothness of η0(x)
than that of η(x), it is natural to choose m = 3 in the roughness penalty functional of (10). For any x ∈ [a, b],
η(x) = η(a) + Z x
a
g(u)du. Substituting the above expression into (10), we have
l = 1 n n X i=1 ˜ ωi{ ˜Yi− η(a) − Z xi a g(x)dx} 2 + λ Z b a [g00(x)]2dx. (11)
We remark that Zhang (2004) specified that ˆg, the minimizer of (5), is a natural
cubic spline; however, according to Wahba (1990), ˆg should be a piecewise quartic
polynomial in [x1, xn] and linear in the two boundary subintervals if the function space
to search for the minimizer is S2[a, b].
Mainly for the computational purpose and also for simplicity, we shall restrict the
function space to search for the minimizer of (11), denoted by ˆg, to be the class of
natural cubic splines. That is, ˆg is set to be a natural cubic spline. We remark that
the class of natural cubic splines is a smaller space than S2[a, b] but it is rich enough
for approximating any reasonably smooth function.
Using the same notation as in Green and Silverman (1994), let η = (η1, . . . , ηn),
γ = (γ2, . . . , γn−1)
T
, g = (g1, . . . , gn)
T
, where ηi = η(xi), gi = g(xi), and γi = g00(xi)
Zhang (2004) by expressing η(x) as η(x) = η(a)+Rx
a g(u)du instead of η(x1)+
Rx
x1g(u)du.
Then the matrices of C and D need to be modified as well. Denote η(a) by ηa.
Let hi = xi+1−xi, for i = 1, . . . , n−1, h0 = x1−a, k1 = h0(2+h0/h1), k2 = −h02/h1,
and k3 = −2h20h1. Let the matrices Q, R, and K be defined as in Subsection 2.1.
Modify the matrices C and D by replacing ci with ci + k1 · e1n+ k2 · en2 and di with
di+ k3· en1, where eki is the k-dimensional unit vector with the i th element being 1 and
0 elsewhere. In words, the matrix C is modified by adding k1 to the first column and
k2 to the second column. Similarly, D is modified by adding k3 to the first column.
Let M = 12C − 241DR−1QT as before.
Proposition 1 gives the relationship between η and g. All the proofs in this paper are given in the Appendix A.
Proposition 1 η = ηa· 1n+ M g.
The main reason that we consider η(a) rather than η(x1) is that the matrix M
constructed in this way is invertible while the matrix M in Zhang (2004) is not. Another advantage is that η(a) can be specified by the vector g, utilizing the fact that a natural
cubic spline is linear on the boundary subinterval [a, x1], see the Appendix A.
Proposition 2 shows that the matrix M in Proposition 1 is invertible. In the follow-ing, when the dimension of a matrix or vector helps readfollow-ing, we will add the dimension
as the subscript in the notation. For example, Mn,n denotes the n by n matrix M .
Proposition 2 Mn,n is invertible for all n ≥ 3.
According to Proposition 1, the vector η can be specified by g and ηa. Given ηa,
(11) can be written in matrix form as
l = 1 n( ˜Y − M g) T ˜ W ( ˜Y − M g) + λgTKg = 1 n{g T(MTW M + nλK)g − 2 ˜˜ YTW M g + ˜˜ YTW ˜˜Y}, (12) where ˜Y = { ˜Y1− ηa, . . . , ˜Yn− ηa} T and ˜W = diag(˜ω1, . . . , ˜ωn).
Since K is semi-positive definite, the matrix MTW M is strictly positive definite by˜
at
ˆ
g = (MTW M + nλK)˜ −1MTW ˜˜Y. (13)
Suppose the parameter function ˆη is monotone increasing. Then ˆg must be non-negative
everywhere. To achieve this, we follow the same approach adopted by Zhang (2004)
by setting ˆg(xi) to 0 when ˆg(xi) < 0. Let ˆg+= (ˆg+(x1), · · · , ˆg+(xn))T, where ˆg+(xi) =
max(ˆg(xi), 0). We then use ˆg+ to construct vector ˆη instead of using ˆg. That is,
ˆ
η = ˆηa· 1n+ M ˆg+. (14)
It is interesting to observe that M ˆg+ can be conceived as integrating the natural cubic
spline that interpolates the points {(xi, ˆg+(xi)), i = 1, . . . , n}. Denote the interpolating
natural cubic spline by ˜g+. Unfortunately, when ˆg+(xk) = ˆg+(xk+1) = 0 for some k, it
is impossible to have ˆg+(x) ≥ 0 for all x ∈ [xk, xk+1]. Then the monotonicity of ˆη is
lost. To remedy this problem, we modify the estimate to ensure the monotonicity as follows.
For xi ≤ x ≤ xi+1, i = 1, ..., n − 1, let
ˆ ηmon(x) = ˆη(a) + i X j=1 τj Z xj xj−1 ˆ g+(u)du + τx Z x xi ˆ g+(u)du,
where τi is an indicator function defined as 1 if ˆηi > ˆηi−1 and 0 otherwise and τx = 1
when Rx
xigˆ+(u)du > 0 and 0 otherwise. Note that the value of
Rx
xiˆg+(u)du can be
obtained as described in Subsection 2.1 since ˆg+ is a natural cubic spline. It is obvious
ˆ
ηmon is monotonic.
For computational purpose, we shall express ˆηmon = (ηmon,1, ..., ηmon,n)T in matrix
form. Let ˜M be the (n + 1) × (n + 1) matrix given by
˜ M = 1 0 T n 1n M ! .
Let N be the n × (n + 1) matrix with elements nij, 1 ≤ i, j ≤ n, given by nii = −1,
ni,i+1 = 1 for 1 ≤ i ≤ n and nij = 0 elsewhere. Let S = (s1, s2, . . . , sn) be the n × n
matrix given by si = (0Ti−1, τi · 1Tn−i+1)
T
for i = 1, . . . , n.
Proposition 3 ηˆmon= ˆηa· 1n+ SN ˜M ¯g , where ¯g =
ˆ ηa ˆ g+ ! .
Note that N ˜M ¯g = N ˜M ηˆa ˆ g+ ! = N ˆ ηa ˆ η1 .. . ˆ ηn = ˆ η1− ˆηa ˆ η2− ˆη1 .. . ˆ ηn− ˆηn−1 ,
which computes the integral Rxi+1
xi ˜g+(u)du for each subinterval. The effect of S is to
accumulate these integrals up to xi but skip those integrals Rxxii+1g(u)du for which
ˆ
ηi ≤ ˆηi−1.
3.3
Algorithm
We propose using the back-fitting approach to obtain estimators ˆηa and ˆg+. For
back-fitting, see Hastie and Tibshirani (1990). Recall that the log-likelihood function for the exponential family is
l = −1 n n X i=1 {Yiη(xi) − q(η(xi))} + λ 2 Z b a [g00(x)]2dx, (15)
where η(xi) = ηa+ M [i, .]g, if g is a natural cubic spline. Given g, we can get a suitable
value of ηa by solving the equation
∂l ∂ηa = −1 n n X i=1 {Yi− ∂q(η(xi)) ∂ηa } = 0.
Substitute the new ηa into (12) to obtain the new estimate of g by (13). Thus,
re-peat the iteration until the value of ηa converges. Since the equation ∂l/∂ηa = 0 is
non-linear, we propose using the Newton-Ralphson method to get the new estimate of η(a). For details of the Newton-Ralphson method, see, for example, Burden and Faires (2001).
Back-fitting Algorithm Step 1
Given the covariate (x1, . . . , xn), calculate the n × n matrix C, n × (n − 2) matrices
Q and D, (n − 2) × (n − 2) matrix R, then compute the matrices K = QR−1QT and
Step 2
Begin with iteration k = 0. Set ˆg(0) = (ˆg(0)
1 , . . . , ˆgn(0))
T
and ˆη(0)
a to some initial values.
Set the tolerance level T. Step 3
Given ˆg(k), update ˆηa(k) by ˆηa(k+1), the minimizer of (15), with the Newton-Ralphson
method. Step 4
Construct the n × n diagonal matrix ˜W and ˜Y with ˆg(k) and ˆη(k+1)
a . Then update ˆg(k) by ˆg(k+1) = (MTW M + nλK)˜ −1MTW ˜˜Y. Step 5 If ˆ η(k+1)(a)−ˆη(k)(a) ˆ η(k)(a)
< T , stop iterating and set vector ˆη = ˆη
(k+1)
a · 1n+ SN ˜M ¯g+.
Oth-erwise, set ˆg(k)← ˆg(k+1)
+ and ˆη(k)a ← ˆη(k+1)a , return to Step 3.
4
Simulation Studies
4.1
Performance Evaluation
To evaluate the effectiveness of the proposed method, we apply the monotone regression smoother on some data generated from exponential family models, including Bernoulli and Poisson data as illustrative examples. The smoothing parameter λ is chosen by the Generalized Cross Validation (GCV) method proposed by Craven and Wahba (1979).
4.1.1 Bernoulli Data
n observations, {(xi, Yi), i = 1, · · · , n}, are generated independently, in which xi’s are
generated independently from interval [0, 1] uniformly and Yi is generated from the
Bernoulli distribution with the probability function P (Yi = 1|xi) = p(xi), where p(x)
the Bernoulli response Y given the covariate x can be written as f (y|x) = exp{y η(x) − log(1 + exp(η(x)))}, where η(x) = log(p(x)/(1 − p(x))).
As an illustrative example, we choose n=200 and p(x) = 1−(1−x4.5)2.5for x ∈ [0, 1].
That is, xi ∼ U (0, 1), Yi ∼ Bernoulli(p(xi)), i = 1, . . . , 200. The simulation results
are displayed in Figure 2. The solid line is the target function p(x), the dotted line is the estimated smooth curve under monotone constraint with smoothing parameter λ = 0.00005 chosen by GCV. It is observed that the estimated function is fairly close to the target function for this example.
In Figure 2(a), the dots are raw data {(xi, Yi)}, while the dots in Figure 2(b) are
binned data. For binning data, the interval [0, 1] is divided into 25 equally spaced subintervals. For each subinterval, we count the points and calculate the proportion of the 1’s in that subinterval. The points plotted on Figure 2(b) are the proportion
of 10s versus the midpoint of the corresponding subinterval. While it is hard to read
from the raw data the information of p(x), the binned data can follow the trend of the underlying target function p(x) pretty well.
4.1.2 Poisson Data
The conditional density of the Poisson distribution can be expressed as
f (y|x) = φ(x)
y
exp (−φ(x))
y! = exp{yη(x) − e
η(x)− log(y!)},
where η(x) = log φ(x). Assume the target function is
φ(x) = log(x2+ 1) for 1 ≤ x ≤ 3.
200 covariates {xi} are generated independently from U (1, 3), and the response Yi
follows Poisson(φ(xi)), for 1 ≤ i ≤ 200. Figure 3 shows the fitting results. In Figure 3,
the solid line is the target function φ(x) and the dotted line is the estimated parameter
see how well the GCV estimate performs, we also show the “optimal” estimated curve
(the dash-dot line) for which λopt = 0.5 about this example is the λ minimizing the
averaged squared error (ASE) defined as
ASE( ˆφ(x)) = 1 n n X i=1 ˆ φ(xi) − φ(xi) 2 .
Note that ˆφ(·) depends on λ.
Again, the dots in Figure 3(a) are raw data {(xi, Yi)} while the dots in Figure 3(b)
are the binned data. When comparing the two estimated curves in terms of ASE, we
find the values of ASE are 0.01190 and 0.01189 for λopt and λGCV, respectively. There
is no obvious difference between the curves corresponding to λ = 1.5 and λ = 0.5. Consider another example in which the curvature of the mean function has more variation. Let φ(x) = 1 e − e −x + x −sin(π x) π for 1 ≤ x ≤ 3.
Similarly, generate 200 xi’s from U (1, 3) randomly and Yi’s accordingly. λGCV = 0.25
while λopt = 2.5 × 10−5 by minimizing ASE criterion. The results are shown in Figure
4. The estimated curve with λGCV has obvious departures from the target function φ.
On the other hand, the “optimal” curve with λopt captures the main trend of the target
function. In addition, the ASE is 0.008 for λopt and is 0.031 for λGCV. Note that, as
observed from Figure 4(b), the binned data are so noisy that it is hard to expect a data-driven method like GCV would perform well.
4.2
Monotone vs. Constraint-Free Smoothing Spline
Estima-tor
We are interested in knowing whether adding the monotone constraint is value-added in estimation of monotone functions. In other words, if the underlying function is monotonic, would a regular (unconstrained) smoother performs poorer than or as well as a constrained smoother? Also, how often a regular smoother would produce a non-monotone estimate when the true function is monotonic? To answer these questions,
we compare the proposed method with the method given in Gu (2002) by a simulation study.
Consider the Bernoulli example in Subsection 4.1.1. The true parameter function
under study is of the form p(x) = 1 − (1 − xα)β, where α and β control the shape and
the speed of going upward of the function. When α ≥ 1 and β ≥ 1, p(x) is increasing for all x ∈ [0, 1]. If β is much larger than α, the curve p(x) climbs up to 1 rapidly. On the other hand, if β is much smaller than α, the curve reaches 1 slowly. Thus, three settings are studied: (α, β) = (1.98, 28), (6.28, 17.67), and (6.9, 1.1) (in the order from fast to slow). Figure 1 shows these three functions. The effect of the sample size is also studied with n = 50, 100, and 200.
Let ˆpm(x) denote the estimate under the monotone constraint and ˆpg(x) denote the
unconstrainted estimate developed by Gu (2002). Define ASE(ˆp) = 1nPn
i=1(ˆp(xi) − p(xi))2
and calculate ASE(ˆpm) and ASE(ˆpg) for the same data set. For each setting of (α, β)
and n, repeat the trial 10000 times. Table 1 displays the percentage of monotone ˆ
pg in 10000 trials. It is observed that Gu’s estimate (ˆpg) tends to produce more
non-monotone estimates for smaller n. Table 2 shows the proportion of ASE(ˆpg) ≥
ASE(ˆpm). We see that, for all n under study, when the true function rises slowly, ˆpm
performs better than ˆpg, while ˆpg performs better than ˆpm for functions rising rapidly.
And for all three target functions, the performance of ˆpm gets better when the sample
size n gets larger. For illustration, Figure 5 shows some cases with monotone ˆpg and
some with non-monotone ˆpg along with the corresponding ˆpm.
For distribution comparison, boxplots of the 10000 ASE(ˆpg) and the corresponding
10000 ASE(ˆpm) are displayed in Figures 6-8. The sample quartiles of these estimates
are given in Table 3. It seems there is no significant difference between the distributions
of ASE(ˆpg) and ASE(ˆpm). In summary, ˆpm performs as well as ˆpg in terms of ASE,
while ensuring monotonicity.
To see how bad (well) the unconstrained smoother can be when it produces a non-monotone (non-monotone) estimate, we generate 10000 cases of non-non-monotone (non-monotone)
ˆ
10000 ˆpg versus the corresponding 10000 ˆpm under the condition that ˆpg is monotonic
(in panel (a)) or not monotonic (in panel (b)). It is clear to see from these figures that
the constrained smoother outperforms the unconstrained one when ˆpg is not
mono-tonic, while performing as well as ˆpg when ˆpg is monotonic. In summary, constraining
monotonicity prevents the chance of poor estimation while attaining about the same performance for cases the unconstrained method performs well.
5
Examples
Return to the motivated WAT-EC example described above. In this section, we demon-strate the proposed method can be useful in setting appropriate EC limits to achieve better discrimination power and in the process of screening WAT test items for further engineering control.
For the first purpose, we generate responses Yi’s directly by checking if the
measure-ments are within the control limits. Suppose that 200 wafers are taken from some lots and measured the voltage at some testkeys on each wafer. Consider the relationship
between the Cp yield and the mean voltage µ(Cp,i) for the i-th wafer satisfying
µ(Cp,i) = 1 + 0.1 exp(−1.5 Cp,i3 ) Ti,
where Cp,i is the yield of the i-th wafer generated randomly from the beta distribution
Beta(8, 2) and Ti is a random variable with probability P (Ti = 1) = P (Ti = −1) = 0.5,
for i = 1 . . . , 200. Beta(8, 2) is chosen to mimic the reality because it is skewed to the
left. The function µ(Cp) indicates that the mean voltage approaches to the target
voltage level 1 as Cp increases to 1. Suppose the distribution of voltage for the i-th
wafer follows the normal distribution with mean µ(Cp,i) and standard deviation 0.1. 9
measurements are generated for each wafer and the mean voltage ¯viis computed. Figure
12 presents a histogram of these 200 mean voltages, for which the sample mean is 1.005 and sample standard deviation is 0.06. Suppose an engineer sets the upper (UCL) and
lower control limits (UCL) at LCL=0.88 and UCL=1.12, respectively. Define Yi = 1 if
¯
rate for Cp yield is estimated and displayed in Figure 13(a). The passing rate curve
seems to lack the discrimination power for Cp ≥ .6 while showing high discrimination
power for Cp in (0.4, 0.5). Such an EC performance curve is not useful for most of the
current processes in IC industries If this test item is critical in nature, the undesirable low discrimination power may be caused by the ad hoc choices of control limits. To demonstrate this, we reset the control limits to LCL=.93 and UCL=1.07. Then the EC performance curve displayed in Figure 13(b) indicates a fairly good discrimination power in the range of (.4, 1). A further reduction of the control limits to LCL=.97 and UCL=1.03 is too stringent, since the passing rate drops down to about 60% or below
even for very high Cp.
Moreover, consider the case that there is no significant relationship between the Cp
yield and voltage. Assume Cp,i ∼ Beta(8, 2) and mean voltage µi ∼ Beta(10, 10) + 0.5,
for i = 1, . . . , 200, and they are independent. Generate 200 ¯vi accordingly as described
above. For these 200 ¯vi, the sample mean is 0.99 and sample standard deviation is
0.10. Let UCL=1.1 and LCL=0.9. The corresponding passing rate presented in Figure
13(d) is flat for most of the Cp range. If we change the control limits, the resulting
passing rate changes only in the level but has a similar pattern. A test items with such kind of EC performance curve indicates the test results may be not so much related to
the Cp yield. Including this test item for EC may merely increase the number of false
alarms. Consequently, this test item should not be chosen for engineering control.
6
Conclusions
In this study, motivated by the WAT-EC problem, we develop a nonparametric mono-tone smoothing spline smoother for analyzing responses from exponential families by combining the methodologies provided in Gu (2002) and Zhang (2004) along with our modification. An algorithm with implementation details is provided. Computation is efficient because we utilize the characteristics of the natural cubic splines. The simu-lation results demonstrate that the proposed method performs well in the regression
models with both the Bernoulli and Poisson responses. When the “true” function is monotonic, the proposed monotone estimator performs about the same as the uncon-strained smoother in terms of the averaged squared error for the cases when the latter performs well. On the other hand, constrained smoother outperforms the unconstrained smoother when the unconstrained smoother produces non-monotone estimates. Thus, the choice is obvious. If the function is monotonic in nature, then we should choose the method with the monotone constraint imposed.
As an illustrative example, we demonstrate the proposed method can be used in screening test items for engineering control and in setting appropriate control limits.
With the nonparametric regression in nature, the proposed method has the great advantage of model flexibility. Also since the method can be applied to all kinds of data as long as they follow the exponential family, it can find many potential applications in areas such as industries, medicine, education, social studies, and so on.
A
Appendix: Proofs
Proposition 1.
η = ηa· 1n+ M g.
Proof. Define xi− xi−1= hi−1, for i = 2, . . . , n, and let h0 = x1− a. Since the function
g is linear in subinterval [a, x1], it shows that
Z x1
a
g(u)du = h0(g1 + g(a))
2 . (16)
By Green and Silverman (1994),
g01 = g2− g1 x2− x1 − 1 6(x2− x1)γ2 and g(x) = g1− (x1− x)g10 for x ≤ x1.
Then g(a) can be expressed in terms of g1, g2, and γ2 as
g(a) = (1 +h0 h1 )g1− h0 h1 g2+ 1 6h0h1γ2.
Substituting the above expression into (16), we get Z x1 a g(u)du = 1 2{h0(2 + h0 h1 )g1 − h02 h1 g2} − 1 24{−2h0 2 h1γ2}. (17)
According to the proposition in Zhang (2004), Z xi x1 g(x)dx = 1 2c T i g − 1 24d T i γ for 1 ≤ i ≤ n,
where g and ci are n × 1 vectors, γ and di are (n − 2) × 1 vectors. ci and di are
the same as that in Zhang (2004). Denoting k1 = (2 + h0/h1)h0, k2 = −h02/h1, and
k3 = −2h02h1, we have ηi = η(a) + Z xi a g(x)dx = η(a) + Z x1 a g(x)dx + Z xi x1 g(x)dx = η(a) + 1 2(k1g1 + k2g2+ c T i g) − 1 24(d T i γ + k3γ2) = η(a) + 1 2(ci+ k1· e n 1 + k2· en2) T g − 1 24(di+ k3· e n−2 1 ) T γ, (18) where ek
i is the k-dimensional unit vector with the i th element being 1 and 0 elsewhere.
Proposition 2. Mn,n is invertible for all n ≥ 3.
Proof. Let Gn,n = 12Cn,nand Hn,n−2 = 241Dn,n−2. Then Mn,n = Gn,n−Hn,n−2Rn−2,n−2−1 QTn−2,n.
It suffices to show that det(Mn,n) 6= 0. Define the (2n − 2) × (2n − 2) matrix N2n−2 by
N2n−2 = " Gn,n Hn,n−2 QTn−2,n Rn−2,n−2 # . By the properties of block matrix decomposition,
" Gn,n Hn,n−2 QT n−2,n Rn−2,n−2 # = " I HR−1 O I # " Mn,n O O R # " I O R−1QT I # .
Since the matrix R is invertible, we then know that det(Mn,n) ∝ det(N2n−2). Thus, we
only need to prove that det(N2n−2) 6= 0 for all n ≥ 3. More specifically, we simplify
the matrix N by some elementary matrix operations into the matrix N0
det(N2n−2) ∝ det(N2n−20 ) = G0n,n Hn,n−20 Q0T n−2,n R 0 n−2,n−2 , (19)
where G0n,n is the matrix with entries gi,j given by gi,i = gi,i+1 = h−1i−1 for 2 ≤ i ≤ n,
g1,1 = k1/2, g1,2 = k2/2, and gi,j = 0 elsewhere; the matrix Hn,n−20 has elements
νi,j, for 2 ≤ i ≤ (n − 1), νi,i−2 = νi,i−1 = h12i−1, ν1,1 =
−h2 0h1 12 , ν2,1 = h1 12,νn,n−2 = hn−1 12 , and 0 elsewhere; Q 0T
n−2,n is the (n − 2) × n matrix with elements qi,j, where
qi,i+1 = −2h−1i − 2h −1
i+1 for 1 ≤ i ≤ (n − 2) and 0 elsewhere; the matrix R
0
n−2,n−2 is
a symmetric band matrix with entries ri,i, ri,i = 14(hi + hi+1) for 1 ≤ i ≤ (n − 2),
ri,i+1 = ri+1,i = 121hi+1 for 1 ≤ i ≤ (n − 3), and 0 otherwise. When n = 2, N20 is a
2 × 2 matrix, where the first row is (k1/2, k2/2) and the second row is (h−11 , h
−1
1 ). It is
clearly that det(N20) > 0 since k1 > k2.
According to Lemma 1 stated below, for all n ≥ 3, the determinant of N2n−20 can
be expressed in terms of N2(n−1)−20 by
det(N2n−20 ) = 1
4det(N
0
2(n−1)−2) + Kn,
where Knis a positive number. Since det(N20) > 0, it follows clearly that det(N
0
2n−2) > 0
For illustration, Appendix B gives N2n−20 for n = 4 and 5 and expresses N2∗5−20 in
terms of N2∗4−20 .
Lemma 1. For the square matrix N2n−20 defined by (19), for n ≥ 3,
det(N2n−20 ) = 1
4det(N
0
2(n−1)−2) + Kn, (20)
for some Kn> 0.
Proof. We will show the equality (20) by induction. Step 1:
When n = 2, it is clear that det(N20) = h0 (h0+ h1)/h12 > 0.
When n = 3, det(N40) = h0(3h0(h1+ h2) + h1(2h1+ 3h2)) 12h2 1h22 = h0 (3 h0+ 3 h1) 12 h12 +h0 3 h0h12+ 2 h13 12 h12h22 + h0 6 h0h1+ 5 h12 12 h12h2 = 1 4det(N 0 2) + K3, where K3 = h0(3 h0h12+2 h13) 12 h12h22 + h0(6 h0h1+5 h12)
12 h12h2 > 0. Thus (20) holds for n = 3.
When n = 4, we calculate the determinant of N2∗4−20 . It follows that
det(N60) = 1 4det(N 0 4) + K4, where K4 = h0 9 h0h12h22+ 6 h13h22+ 12 h0h1h23+ 10 h12h23+ 4 h0h24+ 4 h1h24 144 h12h22h32 + h0 18 h0h12h2+ 12 h13h2 + 30 h0h1h22+ 25 h12h22+ 12 h0h23+ 12 h1h23 144 h12h22h3 .
It is clear that (20) holds since h0, h1, h2, h3 > 0.
Step 2:
We would like to prove (20) for general n by induction. For a fixed k > 5, assume
When n = k, consider the matrix form for N2k−20 N2k−20 = " G0k,k Hk,k−20 Q0T k−2,k R 0 k−2,k−2 # .
By the structures of G0,H0,Q0, and R0, we can write the determinant of N2k−20 in terms
of N2(k−1)−20 . More specifically, we interchange some column vectors of N2k−20 , such that the i-th column vector becomes the (i−1)-th column vector for i = k+1, . . . , 2k−1, and the k-th column vector becomes the (2k − 1)-th column vector by some permutation operations. Do the same thing with the row vectors. We then obtain
det(N2k−20 ) = N2k−40 U2k−4,2 V2,2k−4 T2,2 (21)
for some matrices U2k−4,2, V2,2k−4, and T2,2. The matrix in (21) can be transformed into
a lower triangle with some elementary matrix operations. More specifically, the block matrix in (21) can be written as
" N2k−40 U2k−4,2 V2,2k−4 T2,2 # = " N2k−40 O2k−4,2 V2,2k−4 T2,20 # · " I −N2k−40−1 U2k−4,2 O I # .
We then obtain det(N2k−20 ) = det(N2k−40 )·det(T2,20 ). The matrix T2,20 is an upper triangle
matrix, where T2,20 [1, 1] = h−1k−1 and
T2,20 [2, 2] = hk−1+ hk−2 4 − h2 k−2 12 N 0−1 2k−4[2k − 4, 2k − 4] + (2h−1k−2+ 2h−1k−1) ( h2 k−2 12 + hk−2 12 N 0−1 2k−4[k − 1, 2k − 4] ) . (22)
By induction hypothesis, det(N2n−20 ) = 1
4det(N
0
2(n−1)−2) + Kn holds for n = k − 1 and
n = k − 2, and det(N2(k−3)−20 ) > 0, we get N2k−40−1 [2k − 4, 2k − 4] < 4/hk−2 by Lemma 2,
and N2k−40−1 [k − 1, 2k − 4] > −hk−2/3 by Lemma 3. It follows that
T2,20 [2, 2] > hk−1+ hk−2 4 − h2 k−2 12 4 hk−2 + (2h−1k−2+ 2h−1k−1) ( h2 k−2 12 − hk−2 12 hk−2 3 ) > h 2 k−2 18 (2h −1 k−2+ 2h −1 k−1) − hk−2 3 + hk−1+ hk−2 4 = hk−2 36 + h2 k−2 9hk−1 + hk−1 4 > hk−1 4 . Then det(N2k−20 ) = T2,20 [2, 2] 1 hk−1 det(N2k−40 ) = hk−1 4 + T 0 2,2[2, 2] − hk−1 4 !! 1 hk−1 det(N2k−40 ).
Define Kk = T2,20 [2, 2] − hk−1 4 det(N0 2k−4) hk−1 . Since T 0 2,2[2, 2] > hk−1 4 and det(N2k−4) > 0, then det(N2k−20 ) = 1 4det(N 0 2k−4) + Kk > 0 as required.
Lemma 2. If det(N2(k−1)−20 ) = 14det(N2(k−2)−20 ) + Kk−1 is satisfied, then
N2k−40−1 [2k − 4, 2k − 4] < 4
hk−2
for k ≥ 4.
Proof. It is known that the elements of the inverse of N2k−4 can be expressed in terms
of its cofactor matrix.
N2k−40−1 [2k − 4, 2k − 4] = det(cof actor(N
0
2k−4[2k − 4, 2k − 4]))
det(N2k−40 ) .
By the definition of N2k−40 , it is easy to see that
det(cof actor(N2k−40 [2k − 4, 2k − 4])) = 1
hk−2
det(N2(k−2)−20 ).
Since det(N2(k−1)−20 ) = 14det(N2(k−2)−20 ) + Kk−1 is satisfied by induction hypothesis, we
obtain N2k−40−1 [2k − 4, 2k − 4] = 1 hk−2det(N 0 2(k−2)−2) 1 4det(N 0 2(k−2)−2) + Kk−1 = 4 hk−2+ Y < 4 hk−2 as required, since Y = 4∗Kk−1 (hk−2det(N2k−40 )) > 0.
Lemma 3. If det(N2n−20 ) = 14det(N2(n−1)−20 ) + Kn is satisfied for n = k − 1, n = k − 2,
and det(N2n−20 ) > 0 for n = k − 3, then
N2k−40−1 [k − 1, 2k − 4] > −hk−2
3 for k ≥ 5.
Proof. Since N0N0−1 = I, it is obvious that N0[k − 1, ·]N0−1[·, 2k − 4] = 0, where
N0[k − 1, ·] is the (k − 1)-th row of N0 and N0−1[·, 2k − 4] is the (2k − 4)-th column of
N0−1. More specifically, it can be shown that
N2k−40−1 [k − 2, 2k − 4] + N2k−40−1 [k − 1, 2k − 4] = −h 2 k−2 12 N 0−1 2k−4[2k − 4, 2k − 4].
Consider the cofactor of N2k−40 [k − 2, 2k − 4]. By the structure of N2k−40 , we obtain
N2k−40−1 [k − 2, 2k − 4] ∝ det(cof ator(N2k−40 [k − 2, 2k − 4])) = − hk−3
12hk−2
det(A), where A is a (2k − 6) × (2k − 6) matrix and all elements in A are the same as the
elements in N2k−60 except A[2k − 6, k − 3] and A[2k − 6, 2k − 6], where A[2k − 6, k − 3] =
−2h−1k−2−3h−1k−3 < N2k−60 [2k−6, k−3] and A[2k−6, 2k−6] = hk−3
4 + hk−2
6 . Similarly, using
elementary matrix operations, the det(N2k−60 ) and det(A) can be formed by N2(k−3)−20 .
We obtain that det(N2k−60 ) = N2k−80 O2k−8,2 V2,2k−80 X2,2 and det(A) = N2k−80 O2k−8,2 V2,2k−80 Y2,2 ,
where both of X and Y are upper triangle matrices with X[1, 1] = Y [1, 1] = h−1k−3, and
X[2, 2] = hk−2 4 + hk−3 4 − c − N 0 2k−6[2k − 6, k − 3]d, Y [2, 2] = hk−2 6 + hk−3 4 − c − A[2k − 6, k − 3]d, where c = h 2 k−3 12 N 0−1 2k−8[2k − 8, 2k − 8] and d = h2 k−3 12 + hk−3 12 N 0−1 2k−8[k − 4, 2k − 8]. Since
det(N2k−60 ) = det(N2k−80 ) · X[1, 1] · X[2, 2] = 14det(N2k−80 ) + Kk−2 > 0 is satisfied, we
find that Kk−2= h−1k−3 hk−3 4 − c − N 0 2k−6[2k − 6, k − 3]d ! > 0. It follows that hk−3 4 − c − A[2k − 6, k − 3]d > hk−3 4 − c − N 0 2k−6[2k − 6, k − 3]d > 0.
Then det(A) = det(N2k−80 ) · Y [1, 1] · Y [2, 2] > 0 and N2k−40−1 [k − 2, 2k − 4] < 0. Since
N2k−40−1 [2k − 4, 2k − 4] < 4
hk−2 is satisfied by Lemma 2, we obtain the the inequality
N2k−40−1 [k − 1, 2k − 4] > −hk−2
B
Appendix: The Construction of N
2n−20in
Propo-sition 2
When n = 4, we find N2∗4−20 = h0 2+h0 h1 2 −h02 2 h1 0 0 −(h02h1) 12 0 1 h1 1 h1 0 0 h1 12 0 0 h1 2 1 h2 0 h2 12 h2 12 0 0 h1 3 1 h3 0 h3 12 0 −2h 1 − 2 h2 0 0 h1+h2 4 h2 12 0 0 −2h 2 − 2 h3 0 h2 12 h2+h3 4 . If n = 5, then N2∗5−20 = h0 2+h0h1 2 −h02 2 h1 0 0 0 −(h02h1) 12 0 0 1 h1 1 h1 0 0 0 h1 12 0 0 0 h1 2 1 h2 0 0 h2 12 h2 12 0 0 0 h1 3 1 h3 0 0 h3 12 h3 12 0 0 0 h1 4 1 h4 0 0 h4 12 0 −2h 1 − 2 h2 0 0 0 h1+h2 4 h2 12 0 0 0 −2h 2 − 2 h3 0 0 h2 12 h2+h3 4 h3 12 0 0 0 −2h 3 − 2 h4 0 0 h3 12 h3+h4 4 .By some permutation operations, det(N2∗5−20 ) can be expressed in terms of det(N2∗4−20 ).
That is, det(N2∗5−20 ) = h0 2+h0h1 2 −h02 2 h1 0 0 −(h02h1) 12 0 0 0 1 h1 1 h1 0 0 h1 12 0 0 0 0 h1 2 1 h2 0 h2 12 h2 12 0 0 0 0 1 h3 1 h3 0 h3 12 0 h3 12 0 −2h 1 − 2 h2 0 0 h1+h2 4 h2 12 0 0 0 0 −2h 2 − 2 h3 0 h2 12 h2+h3 4 0 h3 12 0 0 0 h1 4 0 0 1 h4 h4 12 0 0 0 −2h 3 − 2 h4 0 h3 12 0 h3+h4 4 = N2∗4−20 U6,2 V2,6 T2,2 .
References
[1] Burden, R. L. and Faires, J. D. (2001). Numerical Analysis, Brooks/Cole, Aus-tralia.
[2] Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation, Numerical Mathematik, 31, 377-403.
[3] de Boor, C. (2001). A Practial Guide to Splines: with 32 figures, Rev. ed., Springer, New York.
[4] Eubank, R. L.(1990). Nonparametric Regression and Spline Smoothing, 2nd ed., Marcel Dekker, New York.
[5] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models, Chapman and Hall, London.
[6] Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and General-ized Linear Models: a Roughness Penalty Approach, Chapman and Hall, London. [7] Gu, C. (2002). Smoothing Spline ANOVA Models, Springer, New York.
[8] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis, 2nd ed., Springer, New York.
[9] Rossi, N., Wang, X., and Ramsay, J. O. (2002). Nonparametric item response function estimates with the EM algorithm, Journal of Educational and Behavioral Statistics, 27, 291-317.
[10] Wahba, G. (1991). Spline Models for Observational Data, Philadelphia, Pennsyl-vania/SIAM.
[11] Wang, Z. (2000). An algorithm for generalized monotonic smoothing, Journal of Applied Statistics, 27, 495-507.
[12] Zhang, J. T. (2004). A simple and efficient monotone smoother using smoothing splines, Nonparametric Statistics, 16(5), 779-796.
Table 1: The proportion of Monotony in 10000 repeats n = 50 n = 100 n = 200 α = 1.98 β = 28 77.89% 91.38% 96.2% ˆ pg α = 6.28 β = 17.67 79.43% 93.46% 97.27% a=6.9 β = 1.1 85.54% 93.36% 94.67%
Table 2: Proportion of ASE( ˆpg)
ASE( ˆpm) ≥ 1 n = 50 n = 100 n = 200 α = 1.98 β = 28 39.91% 43.76% 48.34% α = 6.28 β = 17.67 49.27% 52.5% 56.12% α = 6.9 β = 1.1 56.3% 61.45% 66.47%
Table 3: The distribution summary for ASE(ˆpg) and ASE(ˆpm), where Q1 is the first
quartile , Q3 is the third quartile. The unit is 10−3.
Q1 Median Mean Q3 ˆ pg 2.3 4.9 7.0 9.3 n = 50 ˆ pm 2.3 4.9 6.8 9.1 ˆ pg 1.6 2.8 4 5.1 α = 1.98 β = 28 n = 100 pˆm 1.5 2.7 3.7 4.9 ˆ pg 1.0 1.7 2.2 2.8 n = 200 ˆ pm 0.83 1.5 2.0 2.6 ˆ pg 1.2 2.9 4.6 6.1 n = 50 ˆ pm 1.2 2.9 4.6 6.1 ˆ pg 0.71 1.7 2.7 3.6 α = 6.28 β = 17.67 n = 100 pˆm 0.7 1.6 2.5 3.4 ˆ pg 0.52 1.08 1.62 2.14 n = 200 ˆ pm 0.44 0.95 1.39 1.87 ˆ pg 1.4 3.1 4.8 6.3 n = 50 ˆ pm 1.8 3.2 5.6 6.9 ˆ pg 0.91 2.0 2.9 3.8 α = 6.9 β = 1.1 n = 100 pˆm 1.1 2.2 3.0 4.0 ˆ pg 0.54 1.1 1.6 2.1 n = 200 ˆ pm 0.59 1.2 1.6 1.2
Table 4: For sample size n = 100 and 10000 monotone ˆpg, the Distribution
summary of ASE(ˆpg) and ASE(ˆpm), where the Q1 is the first quartile ,
Q3 is the third quartile. The unit is 10−3.
Q1 Median Mean Q3 ˆ pg 0.7 1.6 2.5 3.3 α = 1.98 β = 28 pˆm 0.8 1.7 2.6 3.5 ˆ pg 0.8 1.8 2.8 3.7 α = 6.28 β = 17.67 pˆm 0.7 1.7 2.6 3.5 ˆ pg 1.3 2.5 3.4 4.4 α = 6.9 β = 1.1 pˆm 1.1 2.2 3.1 4.1
Table 5: For sample size n = 100 and 10000 non-monotone ˆpg, the
dis-tribution summary of ASE(ˆpg) and ASE(ˆpm), where the Q1 is the first
quartile , Q3 is the third quartile. The unit is 10−3.
Q1 Median Mean Q3 ˆ pg 5.4 8.1 9.2 11.5 α = 1.98 β = 28 pˆm 0.8 1.8 2.7 3.6 ˆ pg 5.0 7.4 8.4 10.5 α = 6.28 β = 17.67 pˆm 0.8 1.8 2.8 3.7 ˆ pg 4.1 6.4 7.5 9.7 α = 6.9 b=1.1 pˆm 1.1 2.3 3.3 4.3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Cp P (( Cp ))
Figure 1: Examples of three EC performance curves.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●● ● ●●●●●●●●●●●● ● ●●●●●● ● ●●●● ● ● ●●●●●●●●● ●● ● ●● ● ● ● ● ●●● ● ●● ●● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●●●● ● ●● ● ● ●● ● ●●●● ●●●● ● ● ●●●●●●●●●●●●●● Bernoulli (a) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bernoulli (b)
Figure 2: Bernoulli data. The solid line is the target function p(x) = 1 − (1 − x4.5)2.5.
The dotted line is the estimated curve with monotone constraint. Dots in the left panel
1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x λλ (( x )) ●● ● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●●● ● ● ● ●● ●●●● ● ● ●● ●●● ●● ● ● ● ● ● ●● ●●●●● ●●● ● ● ● ● ●● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ●● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● Poisson (a) 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x λλ (( x )) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Poisson (b)
Figure 3: Poisson data. The solid line is the target function φ(x) = log(x2+ 1). Dotted
line is the estimated curve by GCV method. Dash-dot line is the estimated curve with
smoothing parameter 1.5. Dots in the left panel are points (xi, yi), i = 1, . . . , 200. Dots
in the right panel are binned data.
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x λλ (( x )) ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ●● ● ● ●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● Poisson (a) 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x λλ (( x )) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Poisson (b)
Figure 4: Poisson data. The solid line is the target function φ(x) = 1
e−e
−x+x−sin(π x) π .
The dotted line is the estimated curve by GCV method. Dash-dot line is the estimated
curve with smoothing parameter 2.5 × 10−5. Dots in the left panel are points (xi, yi),
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ●●●●●●●●●●● ●● ● ●●●● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●●● ● ●● ●●●●●●● ●● ●●●● ●●●●● ●● ●●●●●●●● ● ● ●●●● ●●● ● ●●● ●●●● ●●● ●●● ●● ●●●●●●● Target Gu Mono (a) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ●●● ●●●●● ●●●●● ● ● ● ●● ● ●●● ● ●●●● ●● ●●●●●●●● ●●●●●● ●●● ● ●●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●● Target Gu Mono (b) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ● ●●● ●● ● ●●●●●●●● ●●●●● ●●●●● ● ●●●●●●●●●●●● ●● ●●●● ●●●● ● ● ●●● ●●● ● ● ●● ● ● ● ● ● ●● ●●●●●●● ●●● ● ●●●● ●●●●●●●●●●●●●● ●●●● Target Gu Mono (c) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ●●●● ●●●●●●● ● ●●●● ●●●●● ●●●● ●●●● ●●● ●● ●●●●●●●●●●● ●●●● ●●● ● ● ● ●●●● ● ●● ●● ● ●● ● ●●●●●●●● ●●●●●● ● ● ●●●●●●●●●●●●●●●● Target Gu Mono (d) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ● ●●●● ●●●● ●●●●●●● ●●●●●●● ● ●●● ●●●●●●●● ● ●●● ●●●●●● ● ●●● ●●●●●●●●●●●● ●●●●●●●●● ● ●●● ● ●●●●● ● ●● ● ●● ● ●● ● ● ●● ● ● ● ●●●● Target Gu Mono (e) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x p(x) ●●●●●●●●● ●●●●●●● ●● ● ●●● ●●●●●●●●●●●●●● ●●●●● ●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ● ● ●● ● ●●● ●●● ● ● ● ● ● ●●● ●●● Target Gu Mono (f)
Figure 5: Illustrative examples. The solid line is the target function p(x) = 1 −
(1 − xα)β. For panels from top to bottom, (α, β) are (1.98,28), (6.28,17.67), and
(6.9,1.1), respectively. The dashed line is the estimated curve ˆpg by Gu(2002). The
dash-dot line is the estimated curve ˆpm with monotone constraint. The dots are data
points. The left three panels shows the cases with monotone ˆpg, while ˆpg in the right