• 沒有找到結果。

建立在伽瑪散度下穩健模型的調整參數之選取

N/A
N/A
Protected

Academic year: 2022

Share "建立在伽瑪散度下穩健模型的調整參數之選取"

Copied!
51
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學理學院應用數學科學研究所 碩士論文

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

建立在伽瑪散度下穩健模型的調整參數之選取 Robust Model Fitting - Selection of Tuning Parameters in

the Aspect of Gamma Clustering

蕭奕 Yi Hsiao

指導教授:杜憶萍博士 Advisor: I-Ping Tu, Ph.D.

中華民國 107 年 6 月 June, 2018

(2)
(3)

誌謝

感謝指導教授杜憶萍老師,讓我在大學的時候接觸統計這個領域,

並因此對統計產生興趣,走上研究的道路。雖然路途上遭遇許多困難,

但每次和老師討論時,能學習到用各種方式去看待不同問題,並找出 問題的關鍵點,再加上不時的鼓勵,讓我一步一步幸運且順利地完成 論文。

感謝大學及研究所的師長們,讓我在求學生涯中打下扎實的數學基 礎。

感謝陪我唸書、一起討論課業準備考試的同學們,以及協助我完成 論文和準備口試的學長姐,有你們陪伴學習更加有趣。

最後感謝我的父母,對我的決定總是給予支持,讓我更專注於課 業。儘管不常回家,但總會讓我知道家永遠是最好的避風港。

(4)
(5)

摘要

Windham 在 1995 年的論文 Robustifying Model Fitting 提出權重分布 方法,解決在存有異常數據情況下可有效的估計均值,這個權重分布 使用了一個參數,這個參數會影響到均值的估計表現。Windham 同時 提出此參數的選取方法,但我們發現在一些數值模擬例子中,這個參 數選取方法表現並不佳。我們提出另一個選取方法,在數值模擬上有 較佳的表現。除了一般的均值估計,我們也成功地把此方法應用在群 聚分析。

關鍵字: 穩健估計、影響函數、伽瑪散度、群聚分析

(6)
(7)

Abstract

In 1995, Windham came out with an idea of weighted distribution in his thesis, Robustifying Model Fitting, and he used the idea to find a mean esti- mator when there are outliers in the original data. There is a tuning parameter in this estimator, and selecting the parameter will affect the mean estimate in the same data. In the same thesis, he also suggested a criterion of selecting the tuning parameter, but we found out that this criterion wasn’t doing well in some simulations. Considering the problem, we propose another criterion which can derive a better mean estimator. Besides, we can also apply this method to clustering problem.

Keywords: robust estimate, influence function, gamma-divergence, cluster- ing

(8)
(9)

Contents

口試委員會審定書 iii

誌謝 v

摘要 vii

Abstract ix

1 Introduction 1

1.1 What is Robust? . . . 1

1.2 How to get Robust? . . . 3

1.2.1 M-estimator . . . 4

1.2.2 Weighted data . . . 5

1.3 Why need to do this question? . . . 6

2 Literature Review 9 2.1 Normal Robust Model . . . 9

2.2 General Description . . . 10

2.3 Selection Criterion . . . 12

2.4 γ-estimate and Weighted Robustified estimate . . . 14

2.4.1 Some Notes about γ-clust . . . 14

2.4.2 Weighted Distribution . . . 15

3 Our Selection Criterion 17

(10)

3.2 gθ is bivariate normal, θ = µ . . . 19

3.3 gθ is univariate normal, θ = (µ, σ2)T . . . 20

3.4 gθ is qGaussian, θ = µ . . . 21

3.4.1 Some Notes about qGaussian . . . 21

3.4.2 Weighted Distribution . . . 23

3.4.3 Applied to gamma-clust . . . 24

4 Numerical Examples 27 4.1 One Dimension Case . . . 27

4.1.1 One component with outliers . . . 27

4.1.2 Five components . . . 31

4.2 Two Dimension-One Component with Outliers . . . 31

4.3 Variance Unknown . . . 32

4.4 qGaussian Model . . . 33

Bibliography 37

(11)

List of Figures

1.1 median, mean, robustified mean . . . 2

1.2 Score function of estimator, assuming the location parameter is 0. . . 5

4.1 . . . 27

4.2 . . . 29

4.3 . . . 29

4.4 . . . 30

4.5 Red curve is the estimated density. . . 32

4.6 Result of 2-dim data. (a) mean of the first dimension and sIF2. (b) scatter plot and the estimated mean (red triangle). . . 33

4.7 Variance unknown simulation. (a) estimated mean and sIF2. (b) estimated variance and sIF2. . . 34

(12)
(13)

List of Tables

2.1 Estimates for ˆµ and ˆσ2 . . . 10

4.1 Result of First Example . . . 28

4.2 Comparing data in first example . . . 28

4.3 Result of Second Example . . . 30

4.4 Comparing data in second example . . . 31

4.5 Result of 1-dim clustering . . . 32

4.6 Result of qGaussain model . . . 33

4.7 Comparing ˆµ and ˆµtrim . . . 34

(14)
(15)

Chapter 1 Introduction

1.1 What is Robust?

Maximum likelihood estimate (MLE) or method of moments are most common way to estimate parameter. Under some regularity conditions, maximum likelihood estimator is consistent. That is, the probability of the estimator converging to the true parameter is 1, as the number of samples goes to infinity. However, finite sample or outliers may hinder the consistent property of the estimate. We can think this through the fundamental spirit of maximizing the likelihood. The likelihood function is defined as the product of the density of the observed data (assume i.i.d.) including outliers. Considering this, it is unreasonable to maximize the likelihood for those outlier terms. And if there are more outliers or the outliers are further far away from the true data, the influence on the estimator become larger. For example, considering a normal distribution model, the maximum likelihood estimator for mean is the sample mean. This estimator can become arbitrary large if we add distant data points in the sample (Figure 1.1). Thus, we say that sample mean is not robust. Another common mean estimator, median, needs over half of the data points to have effect on the estimator, hence is robust. We may want to know how many portions of data can influence the estimator, or how much a data point influence the estimator, thus following are some measurements often used for quantify robustness.

(16)

Figure 1.1: median, mean, robustified mean

Measurement of Robustness

• Influence function (Hampel, 1968) is defined as

IF (x; T, F ) = lim

ε→0

T{(1 − ε)F + εδx} − T (F )

ε ,

where F is the distribution function, T (F ) is the parameter estimator, and δx(u) is a heaviside step function, which is 0 when u < x, and is 1 otherwise.

This definition means that we perturbate the distribution at a point x, and have in- terest in the variation rate of the estimator.

• Gross-error sensitivity is defined as

γ(T, F ) = sup

x |IF (x; T, F )|,

γ(T, F ) is the supreme of the absolute value of influence function, which measures the worst influence a data point can make on the estomator.

(17)

• Breakdown point is defined as

ε = inf{ε : sup

x |T (F ) − T (Fε)| = ∞}, where Fε = (1− ε)F + εδx.

Let’s give an example. Suppose F is the distribution, then sample mean can be expressed as T (F ) =!

udF . Hence the influence function of T (F ) is

IF (x; T, F ) = lim

ε→0

T{(1 − ε)F + εδx} − T (F ) ε

= lim

ε→0

! ud((1− ε)F + εδx)−! udF ε

= lim

ε→0

ε!

udδx− ε! udF

ε = x− T (F )

The gross-error sensitivity is

γ(T, F ) = sup

x ∥IF (x; T, F )∥ = sup

x ∥x − T (F )∥ = ∞, which is the worst case in the view of robustness.

And the breakdown point of sample mean is

ε = inf{ε : sup

x |T (F ) − T (Fε)| = ∞} = inf{ε : sup

x |ε(x − T (F ))| = ∞} = 0, which is also the worst case.

Now, we figure out that sample mean is not robust, so next section is some ideas to get robust.

1.2 How to get Robust?

One way to get robust is to improve maximum likelihood estimator, since it is maximum likelihood type, we called it M-estimator.

(18)

1.2.1 M-estimator

• MLE: min"

− log f(xi; θ), where f is the density function.

Generalized MLE: min"

p(xi; θ), where p can be any function.

• If p(x; θ) is differentiable with respect to θ, then the M-estimator is to solve

EF[ψ(x; θ)] = 0,

where ψ(x; θ) = ∂p(x;θ)∂θ is the score function.

• The influence function of the M-estimator

Let T (F ) be the M-estimator with EF[ψ(x; T (F ))] = 0, for all distributions F . Hence,

EFε[ψ(x; T (Fε))] = 0

To derive influence function, take derivative with respect to ε,

∂εEFε[ψ(x; T (Fε))] = 0

∂ε !

ψ(u; T (Fε))d((1− ε)F (u) + εδx) = 0

If the integration and differentiation can be interchanged, then

⇒!

∂ε{ψ(u; T (Fε))d((1− ε)F (u) + εδx)} = 0

⇒! #

∂ε{ψ(u; T (Fε))}d((1 − ε)F (u) + εδx) + ψ(u; T (Fε))d(∂ε{(1 − ε)F (u) + εδx})$

= 0

⇒! #

∂θ[ψ(u; θ)]θ=T (Fε)dFε(u)·∂ε[T (Fε)] + ψ(u; T (Fε))d(δx− F (u))$

= 0

⇒! #

∂θ[ψ(u; θ)]θ=T (F )dF (u)· ∂ε[T (Fε)]ε=0+ ψ(u; T (F ))d(δx− F (u))$

= 0 Note that ∂ε[T (Fε)]ε=0 = IF (x; T, F ) and

! ψ(u; T (F ))d(δx− F (u)) = ψ(x; T (F )), we have IF (x; T, F ) = ψ(x;T (F ))

!

∂θ[ψ(u,θ)]θ=T (F )dF (u)

Figure 1.2 shows various score functions of location estimators θ, and θ = 0.

Huber estimator (1964) is an M-estimator with score function ψk(x; θ) = max(−k, min(k, x−

θ)), the black line is for k = 1. And Windham estimator (1995) is also an M-estimator with ψc(x; θ) = √

c + 1 exp(−c(x− θ)2)(x− θ), the red curve is for c = 0.5 and blue

(19)

curve is for c = 1.

For different function p(x; θ) or ψ(x; θ), the estimator derived has different robustness.

This property can be implicated by Figure 1.2, traditional score function of MLE ψ(x; θ) = x− θ, the dashed line, counts every point x and going to arbitrary large when x goes to infinity, while other robustified estimators don’t.

Figure 1.2: Score function of estimator, assuming the location parameter is 0.

1.2.2 Weighted data

Another way to get robust is to weight the data such that the point more likely to happen has higher weight, and less likely to happen has lower weight. So, thinking a weight with such property, one can immediately came up with density function, and Windham gave an extra exponent c to the power of the density function. For example, data {x1, x2, ..., xn} are from normal distribution F with mean 0 and unknown variance θ, then its weight function is

w(xi; θ) = Kφc(xi; θ),

where φ is the normal density function with mean 0 in the model, c is a fixed exponent term, K is a constant so that the sum of weight is 1.

(20)

Hence, the weighted sample variance is

v =

%n i

w(xi; θ)x2i

And the weighted population variance is

v =

&

x2w(x; θ)φ(x; θ)dx,

where

w(x; θ) = Kφc(x; θ), (1.1) K = [!

φc(x; θ)dF ]−1 = EFc(x; θ)]−1 is a constant. In this model dF = φ(x; θ)dx, hence!

w(x; θ)dF = !

w(x; θ)φ(x; θ)dx = 1. Notice that w(x; θ)φ(x; θ), called the weighted density function, and is same as the normal density with mean 0 and variance v = θ/(c + 1). If the model is correct, v and v should be close. Thus, using these relations, we can derive the robust estimate of parameter θ.

Also, there is a exponent term c, which can adjust weight and hence called tuning param- eter. Windham (1995)[3] studied in how to select this parameter and called this method Robust Model Fitting. In this thesis, we propose another method of selection and do re- search with it.

1.3 Why need to do this question?

Past robust model, such as Huber estimate (1964)[2], Windham (1995)[3] estimate, and Akifumi Notsu, Shinto Eguchi (2016)[1] Gamma-clust robust estimate, there are tuning parameters. Tuning parameters will affect the robustness of the model (Figure 1.2). In Windham’s model, parameter c determines the weight of each point. Selection of c is im- portant when doing clustering analysis, since larger the c becomes, more robust the model will be, and number of clusts will closer to the number of data points; on the contrary, smaller the c becomes, less robust the model will be, and number of clusts will go to 1.

In general, there is a reasonable interval to choose the parameter, and usually is selected

(21)

subjectively. By using influence function and Fisher information, we give a objective selection criterion, and more details can be seen in chapter 3.

(22)
(23)

Chapter 2

Literature Review

2.1 Normal Robust Model

We use a simple example to elaborate Windham’s iterative algorithm estimates. Assume there are 400 samples {x1, x2, ..., x400}, and among of them, there are 340 samples are from N(µ, σ2), µ = 0, σ2 = 1 and the rest are outliers around µ = 7. We want to estimate µ and variance σ2.

• First step : Find initials.

Use sample mean and sample variance (MLE in normal distribution) as initials.

ˆ

µ = ¯x = 0.99, ˆσ2 = 4001 "

(xi− ¯x)2 = 7.19

• Second step : Weight data.

wi = w(xi, ˆµ, ˆσ2) = Kφc(xi; ˆµ, ˆσ2), c is chosen to be 0.37, φ is normal density function, and K is a constant such that"400

i=1wiis 1.

Similarly, we use MLE on the weighted data, and get, ˆ

m ="

wixi = 0.45, ˆs2 ="

wi(xi− ˆm)2 = 3.93

• Third step : Find the non-weighted parameters such that we can get ˆm and ˆs2 in second step after weighting.

That is, if the model is correct, which is satisfied normal distribution with den-

(24)

Iteration µˆ σˆ2

1 0.993392 7.191821

2 0.4594806 5.380773

3 0.2499072 3.751013

4 0.09685292 2.415905

5 -0.01083447 1.445184

6 -0.05552865 1.044782

7 -0.06715192 0.9457484 8 -0.07040669 0.9195676

9 -0.07137085 0.912215

Table 2.1: Estimates for ˆµ and ˆσ2.

is φ(x; µ, σ2/(c + 1)). Thus, we can revise the new parameters to ˆµ+ = ˆm = 0.45, ˆσ+2 = 1.37ˆs2 = 5.38.

• Fourth step : Repeat step 1 to 3, until the estimates converge. In other words, replace ˆ

µ and ˆσ2by new estimates ˆµ+and ˆσ+2 until they converge (Table 2.1).

2.2 General Description

We now use mathematical way to describe the example above. In the beginning, we ob- serve that there are n one dimension data, {x1, x2, ..., xn} from the distribution F , and its empirical distribution is ˆF . T ( ˆF ) is a parameter estimator, and T ( ˆF ) = (¯x,n1 "

(xi− ¯x)2) is the maximum likelihood case. And note that T (F ) satisfies EF[ψ(x; T (F ))] = 0, where ψ is the score function of T (F ). Moreover, Windham defined weighted distribution as dFc,t(x) = w(x; t)dF (x), where w(x; t) = Kgc(x; t), g is the density in the model, K = {EF[gc(x; t)]}−1, and here we assume g is normal. If we plug ˆF into F , then d ˆFc,t(x) = ˆw(x; t)d ˆF (x), where ˆw(x; t) = ˆKgc(x; t), and ˆK ={Eˆ[gc(x; t)]}−1.

(25)

Hence, the second step is,

T ( ˆFc,t) = ( ˆm, ˆs2) = (

&

xd ˆFc,t,

&

(x− ˆm)2d ˆFc,t)

= (

&

ˆ

w(x; t)xd ˆF ,

&

ˆ

w(x− ˆm)2d ˆF )

= (1 n

% gc(xi; t)

1 n

"

gc(xj; t)xi, 1 n

% gc(xi; t)

1 n

"

gc(xj; t)(xi − ˆm)2)

= (%

wixi,%

wi(xi− ˆm)2),

where wi = w(xi; t) = "gcg(xc(xi;t)j;t).

Next, we want to know the difference of the estimates between the weighted model and non-weighted one. Hence, assume that the normal density of F is g(x; µ, σ2), We have,

&

w(x; µ, σ2)xdF =

! gc+1(x; µ, σ2)xdx

! gc+1(x; µ, σ2)dx = µ

&

w(x; µ, σ2)(x− ˆm)2dF =

&

w(x; µ, σ2)(x− µ)2dF

=

! gc+1(x; µ, σ2)(x− µ)2dx

! gc+1(x; µ, σ2)dx

= σ2 (c + 1)

From the formula above, we know that mean is unchanged and the variance is multiplied by (c+1)1 in the weighted model. Therefore, a function τ is defined to transform the esti- mator between weighted and non-weighted distribution,

( ˆm, ˆs2) = τ (ˆµ+, ˆσ2+) = (ˆµ+, ˆσ2+/(1 + c))

(ˆµ+, ˆσ2+) = τ−1( ˆm, ˆs2) = ( ˆm, (1 + c)ˆs2) For N ≥ 0, t(0) = T ( ˆF ),

t(N +1) = τ−1{T ( ˆF )}

(26)

is an iterative procedure. And the Windham robustified estimate (WRE) Tc( ˆF ) is hence defined as Tc( ˆF ) = ˆθ, where ˆθ is the estimate from the last iteration.

2.3 Selection Criterion

Tcis a solution of EF[w(x; θ)ψ(x; τ (θ))] = 0, hence is an M-estimator, the corresponding score function is

ψc(x; θ) = w(x; θ)ψ(x, τ (θ)) (2.1) Let h(t) = τ−1{T (Fc,t)}, if the process converges, we will have h(θ) = θ. Hence,

EFc,t[ψ[x; τ{h(t)}]] = 0

Take derivative with respect to t, and derive the convergence rate h(t).

∂tEFc,t[ψ[x; τ{h(t)}]] = 0

⇒ ∂

∂tEF[w(x; t)ψ[x; τ{h(t)}]] = 0

⇒ EF[ '∂

∂tw(x; t) (

ψ[x; τ{h(t)}] + w(x; t) '∂

∂tψ[x; τ{h(t)}]

( ] = 0

⇒ EF[cw(x; t)(∂

∂tlog g(x; t))ψ(x; τ(h(t)))] + EF[w(x; t)∂ψ(x; τ (h(t)))

∂τ (h(t))

∂τ (h(t))

∂h(t)

∂h(t)

∂t ] = 0 where ∂tw(x; t) = cw(x; t)(∂t log g(x; t))

When it is about to converge, i.e. h(t) = t,

h(t) =−c{EFc,t[∂ψ(x; τ (t))

∂τ (t) ]∂τ (t)

∂t }−1EFc,t[ψ(x; τ (t))∂

∂t(log g(x; t))]

=−cEF[∂t(log g(x; t))ψc(x; t)]

EF[w(x; t)∂ψ(x;τ (t))

∂τ (t)

∂τ (t)

∂t ]

= −cEF[∂t(log g(x; t))ψc(x; t)]

EFc(x; t)]− EF[cw(x; t)∂t(log g(x; t))ψ(x; τ (t))]

= −cEF[∂t(log g(x; t))ψc(x; t)]EFc(x; t)]−1 I− cEF[∂t(log g(x; t))ψc(x; t)]EFc(x; t)]−1

= cB(t){I + cB(t)}−1,

(27)

Since

ψc(x; t) = ∂

∂t(ψc(x; t)) = cw(x; t)(∂

∂tlog g)ψ(x; τ(t)) + w(x; t)∂ψ(x; τ (t))

∂τ (t) τ, (2.2) where B(t) = −EF[∂t(log g(x; t))ψc(x; t)]EFc(x; t)]−1, which is related to influence function. In fact,

B{Tc(F )} = EF[−EFc(x; t)|t=Tc(F )]−1ψc(x; Tc(F ))· ∂

∂t(log g(x; t))|t=Tc(F )]

= EF[IF (x; Tc, F )s(x; Tc(F ))],

where IF (x; Tc, F ) =−EFc(x; t)|t=Tc(F )]−1ψc(x; Tc(F )) is the influence function and s(x; Tc(F )) = ∂t(log g(x; t))|t=Tc(F )is the score function.

According to Cauchy–Schwarz inequality,

{EF[s2(x; Tc(F ))]EF[IF2(x; Tc, F )]}−1 ≤ B−2{Tc(F )}

And Windham noted that the left hand side is just the asymptotic relative efficiency, which is the reciprocal of the Fisher information times the asymptotic variance. Hence, he used B−2as a criterion for choosing c, which is

ρ(c) = B−2{Tc(F )}

By h(t) = cB(t){I + cB(t)}−1,

ρ(c) = (c/h(t)− c)2

Thus, the tuning parameter c is chosen by

ˆ

c = arg max

c ρ(c)

(28)

In simulation, the convergence rate h(t) can be estimated by|t|t(N(N )−1)−t−t(N(N−1)−2)||, and this is the method Windham used.

2.4 γ-estimate and Weighted Robustified estimate

2.4.1 Some Notes about γ-clust

• γ-divergence

Like K-L divergence, is a measure of the difference between two probability distri- butions.

• γ-cross entropy

Suppose {x1, x2, ..., xn} are from the distribution with density f. And there is a density gθ we assumed that it is the model density with θ unknown. The γ-cross entropy dγ(f, gθ) is defined as

dγ(f, gθ) = −1 γ log{

&

g(x; θ)γf (x)dx} + 1 1 + γ log

&

g(x; θ)1+γdx

dγ(f, gθ) can be empirically estimated by

dγ( ˆf , gθ) =−1 γ log{1

n

%n i=1

g(xi; θ)γ} + 1 1 + γ log

&

g(x; θ)1+γdx,

where ˆf is the empirical pdf.

The small value of γ-cross entropy means two distributions are close, so the robust estimator ˆθγcan be defined by minimizing dγ( ˆf , gθ), i.e.

θˆγ = arg min

θ dγ( ˆf , gθ)

We substitute gθ to g(x; µ, Σ), where g is a p-variate normal density function.

dγ( ˆf , g(x; µ, Σ)) =−1 γ log{1

n

%n

g(xi; µ, Σ)γ} + 1 1 + γ log

&

g(x; µ, Σ)1+γdx

(29)

⇒ e−dγ( ˆf ,g(x;µ,Σ)) ={1 n

%n

g(xi; µ, Σ)γ}1/γ{

&

g(x; µ, Σ)1+γdx}1+γ1

Note that

&

g(x; µ, Σ)1+γdx =

&

{(2π)p2(detΣ)12 exp (−1

2(x− µ)TΣ−1(x− µ))}1+γdx

= (2π)2 (1 + γ)12(detΣ)γ2

⇒ e−dγ( ˆf ,g(x;µ,Σ))γ = 1 n

%g(xi; µ, Σ)γ{

&

g(x; µ, Σ)1+γdx}1+γγ

= 1 n

%g(xi; µ, Σ)γ{(2π)2 (1 + γ)12(detΣ)γ2}1+γγ

∝%

g(xi; µ, Σ)γ(detΣ)2(1+γ)γ2

Thus, we derive a function of γ which is likelihood function in γ-divergence sense.

• γ-utility function

Lγ(µ, Σ) = (detΣ)2(1+γ)γ2

%n i=1

g(xi; µ, Σ)γ

Hence, the robust estimator is to maximize Lγ(µ, Σ)

2.4.2 Weighted Distribution

To maximize γ-utility function, we take derivatives with respect to µ and set it be 0. Here, g(x; µ, σ2) is univariate normal density.

∂ ∂ %

(30)

⇒ ∂"

gγ(xi; µ, σ2)

∂µ = 0

⇒ γ%

gγ−1(xi; µ, σ2)∂g(xi; µ, σ2)

∂µ = 0

⇒%

gγ(xi; µ, σ2)(xi− µ) = 0

⇒ ˆµ =

"

gγ(xi; µ, σ2)xi

"

gγ(xi; µ, σ2)

Similarly, take derivatives w.r.t. σ2,

∂σ2Lγ(µ, σ2) = ∂

∂σ2

%gγ(xi; µ, σ2)(σ2)2(1+γ)γ2 = 0

⇒ γ2

2(1 + γ)(σ2)2(1+γ)γ2 −1%

gγ(xi; µ, σ2) + (σ2)2(1+γ)γ2 %

γgγ−1(xi; µ, σ2)∂g(xi; µ, σ2)

∂σ2 = 0

⇒ γ

2(1 + γ)

%gγ(xi; µ, σ2) + σ2%

gγ−1(xi; µ, σ2)∂g(xi; µ, σ2)

∂σ2 = 0

and since

∂g(x; µ, σ2)

∂σ2 = ∂

∂σ2(2πσ2)12e(x2σ2−µ)2

= (2π)12(−1

2)(σ2)32e(x2σ2−µ)2 + (2π)122)12(x− µ)2

2(σ2)2 e(x2σ2−µ)2

=−1

2(σ2)−1g(x; µ, σ2) + g(x; µ, σ2)(x− µ)2 2(σ2)2 We have,

γ 2(1 + γ)

%gγ(xi; µ, σ2) +%

gγ(xi; µ, σ2)(−1

2 +(xi− µ)22 ) = 0

⇒ 1

1 + γ

%gγ(xi; µ, σ2) = %

gγ(xi; µ, σ2)(xi− µ)2 σ2

⇒ ˆσ2 = (1 + γ)

"

gγ"(xi; µ, σ2)(xi− µ)2 gγ(xi; µ, σ2)

We figure out that the estimator we derived from the aspect of γ-divergence is exactly a Windham robustified estimate.

(31)

Chapter 3

Our Selection Criterion

In this chapter, we propose another criterion to choose c. From chapter two, we knew that the criterion B−2 is related to convergence rate of the estimates and hence can be easily calculated. However, if we look into it carefully, the true criterion is {EF(s2)EF(IF2)}−1 rather than its upper bound B−2. Therefore, our selection criterion of the tuning parameter c is,

ˆ

c = arg max

c {EF(s2(x; Tc))EF(IF2(x; Tc, F ))}−1

The following are some reasons that we prefer using {EF(s2EF(IF2))}−1(denoted sIF2) as criterion:

• The true criterion is {EF(s2)EF(IF2)}−1rather than its upper bound B−2.

• In the same model, the selection of c is stable, since we calculated the criterion directly.

• The MSE of the estimator is smaller.

• If c is large, then ρ(c) = (c/h(t)− c)2 becomes large. That is, ρ tends to choose larger c.

• Although the convergence rate is easy to compute, it’s unstable.

In each section we discuss and calculate the criterion sIF2 under various model distri-

(32)

3.1 g

θ

is univariate normal, θ = µ

The distribution density function gθ(x) = g(x; µ) in this model is univariate normal with unknown mean µ and variance 1.

From eq. (1.1) w(x; µ) = Kgc(x; τ (µ)) and from eq. (2.1) ψc(x; µ) = w(x; µ)ψ(x; τ (µ)), where τ(µ) = µ and ψ(x; µ) = x − µ

First, the score function is

s(x; µ) = ∂

∂µ{log g(x; µ)}

= ∂

∂µ{log 1

√2π − (x− µ)2

2 } = x − µ The Fisher information is

EFˆ(s2(x; µ)) = 1 n

%(xi− µ)2

Second, since Tc(F ) is an M-estimator, its influence function can be written as

IF (x; Tc, F ) =−EFc(x; µ)|µ=Tc(F )]−1ψc(x; Tc(F ))

=−EF[cw(x; Tc(F ))(x− Tc(F ))2− w(x; Tc(F ))]−1w(x; Tc(F ))(x− Tc(F ))

= w(x; Tc(F ))(x− Tc(F )) 1− cEF[w(x; Tc(F ))(x− Tc(F ))2]

⇒ IF2(x; Tc, F ) = w∗2(x; Tc(F ))(x− Tc(F ))2 (1− cEF[w(x; Tc(F ))(x− Tc(F ))2])2 From eq. (2.2), ψc(x; µ) = cw(x; µ)(x− µ)2− w(x; µ) Therefore, the asymptotic variance is

EF(IF2(x; Tc, F )) = EF[w∗2(x; Tc(F ))(x− Tc(F ))2] (1− cEF[w(x; Tc(F ))(x− Tc(F ))2])2

EFˆ(IF2(x; Tc, ˆF )) = n"

w2(xi; Tc( ˆF ))(xi− Tc( ˆF ))2 (1− c"

w(x; T ( ˆF ))(x − T ( ˆF ))2)2

(33)

And the asymptotic relative efficiency is

{EFˆ(s2(x; Tc( ˆF )))EFˆ(IF2(x; Tc, ˆF ))}−1

={1−%

w(xi; Tc( ˆF ))(xi−Tc( ˆF ))}2/{%

(xi−Tc( ˆF ))2%

w2(x; Tc( ˆF ))(xi−Tc( ˆF ))2} Hence, we derived the criterion sIF2 in the model of distribution gθ(x) is univariate normal with unknown mean µ and variance 1.

3.2 g

θ

is bivariate normal, θ = µ

The distribution density function gθ(x) = g(x, µ) in this model is bivariate normal with unknown mean µ and variance I2, where I2is a 2 by 2 identity matrix.

The Fisher information matrix is

EFˆ(s2(x; µ)) = EFˆ[(x− µ)(x − µ)T] = 1n"n

i=1(xi− µ)(xi− µ)T Robust Model Fitting part :

EFc(x; µ)] = EF[cw(x; µ)(x− µ)(x − µ)T + w(x; µ)(−I2)] (from eq. (2.2))

= cEF[w(x; µ)(x− µ)(x − µ)T]− I2

EFˆc(x; µ)] = c"

w(xi, µ)(xi− µ)(xi− µ)T − I2

And the influence function is

IF (x; Tc, F ) =−EFc(x; µ)|µ=Tc(F )]−1ψc(x; Tc(F )) Hence,

IF2(x; Tc, F ) =−EFc(x; µ)|µ=Tc(F )]−1ψc(x; µ)(−EFc(x; µ)|µ=Tc(F )]−1ψc(x; Tc(F )))T

= EFc(x; µ)|µ=Tc(F )]−1ψc(x; Tc(F )ψc(x; Tc(F ))TEFc(x; µ)|µ=Tc(F )]−1 Note that

EFc(x; Tc(F )ψc(x; Tc(F ))T] = EF[w(x− Tc(F ))(w(x− Tc(F )))T] EFˆc(x; Tc( ˆF )ψc(x; Tc( ˆF ))T] = "

wi2(xi− Tc( ˆF ))(xi− Tc( ˆF ))T, where wi is the weight for the i-th data. We have

EFˆ[IF2(x; Tc, ˆF )]

(34)

(c"

wi(xi− Tc( ˆF ))(xi− Tc( ˆF ))T − I2)−1

Finally, we derive the formula of {EFˆ[s2]EFˆ[IF2]}−1 and we choose the reciprocal of the product of the largest eigenvalue of EFˆ[s2(x; Tc( ˆF ))] and EFˆ[IF2(x; Tc, ˆF )] as our criterion.

3.3 g

θ

is univariate normal, θ = (µ, σ

2

)

T

The distribution density function gθ(x) = g(x, (µ, σ2)T) in this model is univariate nor- mal with unknown mean µ and variance σ2.

The Fisher information matrix is

EF[s2(x; θ)]ij = EF[s(x; θ)(s(x; θ)T)]ij = EF[(∂θilog g(x; θ))(∂θ

j log g(x; θ))]

Here, θ = (µ, σ2)T Hence,

EF(s(x; θ)(s(x; θ))T) = EF

⎢⎣

(x−µ)2 σ4

(x−µ)

σ2 (−12 +(x−µ)4 2)

(x−µ)

σ2 (−12 +(x−µ)42) (−12 +(x−µ)42)2

⎥⎦

ψ(x; θ) =

⎢⎣ x− µ (x− µ)2− σ2

⎥⎦

Robust Model Fitting part :

w(x; θ) = c+11 exp(−c2(x− µ)2)

ψc(x; θ) = w(x; θ)ψ(x; τc(θ)) = w(x; θ)

⎢⎣ x− µ (x− µ)2c+1σ2

⎥⎦

EFc(x; θ)]

= EF

⎢⎣

∂µw(x; θ)(x− µ) ∂σ2w(x; θ)(x− µ)

∂µw(x; θ)[(x− µ)2c+1σ2 ] ∂σ2w(x; θ)[(x− µ)2c+1σ2 ]

⎥⎦

= EF

⎢⎣A B C D

⎥⎦

(35)

We know

∂µw(x; θ) = cw(x; θ)x− µ σ2 and

∂σ2w(x; θ) = cw(x; θ)(x− µ)24 Thus,

A = ∂µ w(x; θ)(x− µ) = cw(x; θ)(x−µ)σ2 2 − w(x; θ) B = ∂σ2w(x; θ)(x− µ) = cw(x; θ)(x−µ)43

C = ∂µ w(x; θ)[(x− µ)2c+1σ2 ] = cw(x; θ)((x−µ)σ2 3xc+1−µ)− 2w(x; θ)(x− µ) D = ∂σ2w(x; θ)[(x− µ)2c+1σ2 ] = cw(x; θ)((x−µ)44(x−µ)2(c+1)2 )− wc+1(x;θ) Hence, from the formula before

IF2(x; Tc, F ) =−EFc(x; θ)|θ=Tc(F )]−1ψc(x; Tc(F ))(−EFc(x; θ)|θ=Tc(F )]−1ψc(x; Tc(F )))T

= EFc(x; θ)|θ=Tc(F )]−1ψc(x; Tc(F ))ψc(x; Tc(F ))TEFc(x; θ)|θ=Tc(F )]−1

Thus, we derive the formula of {EF[s2]EF[IF2]}−1and choose the reciprocal of the prod- uct of the largest eigenvalue of EF[s2(x; Tc(F ))] and EF[IF2(x; Tc, F )] as our criterion.

3.4 g

θ

is qGaussian, θ = µ

3.4.1 Some Notes about qGaussian

• β-power model pdf (qGaussian):

fβ(x; µ, Σ) = cβ(det2πΣ)12{1 − β

2 + (p + 2)β(x− µ)TΣ−1(x− µ)}1/β+

,where β > −2/(p + 2), x ∈ Rp

As β → 0, fβ converges to normal density.

f0(x; µ, Σ) := limβ→0fβ(x; µ, Σ)

= limβ→0cβ∥2πΣ∥12{1 − 2/β+(p+2)1 (x− µ)TΣ−1(x− µ)}1/β+

= c0∥2πΣ∥12 exp(−12(x− µ)TΣ−1(x− µ)), which is N (µ, Σ)

(36)

• γ-cross entropy

In section 2.4.1, we have given the form of empirical γ-cross entropy dγ( ˆf , gθ)

dγ( ˆf , gθ) = −1 γ log{1

n

%n i=1

g(xi; θ)γ} + 1 1 + γlog

&

g(x; θ)1+γdx

,where ˆf is the empirical pdf.

Hence, we substitute gθ to fβ,

dγ( ˆf , fβ(x; µ, Σ))

=−1 γ log{1

n

%n

fβ(xi; µ, Σ)γ} + 1 1 + γ log

&

fβ(x; µ, Σ)1+γdx

⇒ e−dγ( ˆf ,fβ)={1 n

%n

fβγ(x; µ, Σ)}1/γ{

&

fβ(x; µ, Σ)1+γdx}1+γ1

note that

&

fβ(x; µ, Σ)1+γdx

=

&

c1+γβ (det2πΣ)1+γ2 {1 − β

2 + p + 2β(x− µ)TΣ−1(x− µ)}

1+γ β

+ dx

=

& c1+γβ cβ/(1+γ)

(det2πΣ)−γ/2kp/2cβ/(1+γ)(det2πkΣ)−1/2 {1 − β/(1 + γ)

2 + (p + 2)β/(1 + γ)(x− µ)T(kΣ)−1(x− µ)}

1+γ β

+ dx

= c1+γβ

cβ/(1+γ)(det2πΣ)−γ/2kp/2

&

fβ/(1+γ)(x; µ, kΣ)dx

= c1+γβ cβ/(1+γ)

(det2πΣ)−γ/2kp/2,

(37)

where k = 2+(p+2)β/(1+γ)β/(1+γ) β 2+(p+2)β

= 2(1+γ)/β+(p+2)2/β+(p+2)

⇒ e−dγ( ˆf ,fβ ={1 n

%n

fβγ(x; µ, Σ)}{

&

fβ(x; µ, Σ)1+γdx}1+γγ

= 1 n

%n

fβγ{ c1+γβ cβ/(1+γ)

(detΣ)−γ/2( k

γ)p/2}1+γγ

∝%

fβγ(xi; µ, Σ)(detΣ)2(1+γ)γ2

= Lβ,γ(µ, Σ)

Hence, the robust estimator is to maximize Lβ,γ(µ, Σ)

• γ-utility function (similar to likelihood function)

Lβ,γ(µ, Σ) = (detΣ)2(1+γ)γ2

%n i=1

fβ(xi; µ, Σ)γ

3.4.2 Weighted Distribution

To maximize this, take derivatives with respect to µ, and consider x ∈ Rp, where p = 1.

∂µ '

2)2(1+γ)γ2 %

fβγ(xi; µ, σ2)(xi; µ, σ2) (

= (σ2)2(1+γ)γ2 γ%

fβγ−1(xi; µ, σ2) ∂

∂µfβ(xi; µ, σ2)

= (σ2)2(1+γ)γ2 γ%

fβγ−1(xi; µ, σ2)(cβ

√ 1 2πσ2

1

β{1 − β 2 + 3β

(xi− µ)2 σ2 }

1 β−1 +

β 2 + 3β

2

σ2(xi− µ))

= (σ2)2(1+γ)γ2 γ%

fβγ−1(xi; µ, σ2)(cβ

√ 1

2πσ2{1 − β 2 + 3β

(xi− µ)2 σ2 }

1 β

+)1−βcββ( 1

√2πσ2)β 2 2 + 3β

xi− µ σ2

= (σ2)2(1+γ)γ2 γ%

fβγ−β(xi; µ, σ2)(xi− µ) 2cββ

(2 + 3β)σ2( 1

√2πσ2)β

Let the result equal to 0, we have

"

fβγ−β(xi; µ, σ2)xi

(38)

Hence, we can define a weighted distribution

dFβ,γ,t(x) = wβ,γ (x; t)dF (x) with wβ,γ (x; t) = fβγ−β(x; t)/EF[fβγ−β(x; t)]

Then we calculate the weighted variance

&

fβγ−β(x; µ, σ2)(x− µ)2dF

=

&

(cβ

√ 1

2πσ2{1 − β 2 + (2 + p)β

(x− µ)2

(σ)2 }1/β+ )γ−β+1dx

=

&

(cβ

√ 1

2πσ2)γ−β+1{1 − β 2 + (2 + p)β

(x− µ)2

(σ)2 }+−β+1)/βdx

=

&

Constant{1 − β 2 + (2 + p)β

(x− µ)2

(σ)2 }(γ−β+1)/β+ dx

( β

2 + (2 + p)β = 1

2/β + (2 + p) → 1

2(γ− β + 1)/β + (2 + p) = β 2γ + 2 + pβ)

= 2 + (2 + p)β 2γ + 2 + pβσ2

⇒ ˆσ = 2γ + 2 + pβ 2 + (2 + p)β

"

fβγ−β(x; µ, σ2)(x− µ)2

"

fβγ−β(x; µ, σ2) }

This result is same as in Local Fixed-Point Algortithm (Akifumi Notsu, Shinto Eguchi, 2016)[1].

3.4.3 Applied to gamma-clust

We applied the result to gamma-clust, and assume that the distribution density function gθ(x) = fβ(x; µ) in this model is univariate qGaussian with unknown mean µ and variance 1.

(39)

First, the score function is,

s(x; µ) = ∂

∂µ{log fβ(x; µ)}

= ∂

∂µ{log cβ

√2π + 1

β log{1 − β

2 + 3β(x− µ)2}+}

= 1 β

2+3β(x− µ) {1 −2+3ββ (x− µ)2)}+

= 2

2 + 3β

(x− µ) {1 −2+3ββ (x− µ)2}+

Hence the Fisher information is,

EFˆ(s2(x; µ)) = 1 n( 2

2 + 3β)2%

i

(xi − µ)2 {1 − 2+3ββ (xi− µ)2}2+

And the influence function is,

IF (x; Tc, F ) =−EFc(x; µ)|µ=Tc(F )]−1ψc(x; Tc(F ))

=−EF[((w(x; µ))ψ(x; µ) + w(x; µ)ψ(x; µ))|µ=Tc(F )]−1w(x; Tc(F ))ψ(x; Tc(F ))

=−EF[(cw(x; µ)( ∂

∂µlog fβ)(x− µ) − w(x; µ))|µ=Tc(F )]−1w(x; Tc(F ))(x− Tc(F ))

= w(x; Tc(F ))(x− Tc(F )) 1− 2+3β2c EF[w(x; Tc(F )) (x−Tc(F ))2

{1−2+3ββ (x−Tc(F ))2}+] Similar to Section 3.1, the empirical asymptotic relative efficiency is,

{EFˆ(s2(x; Tc( ˆF )))EFˆ(IF2(x; Tc, ˆF ))}−1

={( 2

2 + 3β)2% (xi− Tc( ˆF ))2 {1 − 2+3ββ (xi− Tc( ˆF ))2}2+

(

"

w2(xi; Tc( ˆF ))(xi− Tc( ˆF ))2 (1− 2+3β2c 1n"

w(xi; Tc( ˆF )) (xi−Tc( ˆF ))2

{1−2+3ββ (xi−Tc( ˆF ))2}+)2)}−1 Thus, we derived the criterion sIF2 in the model of distribution gθ(x) is univariate qGaus-

sian with unknown mean µ and variance 1.

(40)
(41)

Chapter 4

Numerical Examples

4.1 One Dimension Case

4.1.1 One component with outliers

In this subsection, we compare two criteria by presenting two examples.

For first example, assume there are 280 samples from N (0, 1) which is the main distri- bution we are going to estimate and 120 samples from N (7, 1) which is seemed to be outliers, and we already know that the variance of the main distribution is 1. The follow- ing is the result: The y-axis of Figure 4.1(a) means the value of {EFˆ(s2)EFˆ(IF2)}−1, and

(a) sIF2 criterion (b) rho criterion

數據

Figure 1.1: median, mean, robustified mean
Figure 1.2: Score function of estimator, assuming the location parameter is 0.
Table 2.1: Estimates for ˆ µ and ˆ σ 2 .
Figure 4.5: Red curve is the estimated density.
+3

參考文獻

相關文件

• helps teachers collect learning evidence to provide timely feedback &amp; refine teaching strategies.. AaL • engages students in reflecting on &amp; monitoring their progress

Teachers may consider the school’s aims and conditions or even the language environment to select the most appropriate approach according to students’ need and ability; or develop

Robinson Crusoe is an Englishman from the 1) t_______ of York in the seventeenth century, the youngest son of a merchant of German origin. This trip is financially successful,

fostering independent application of reading strategies Strategy 7: Provide opportunities for students to track, reflect on, and share their learning progress (destination). •

Strategy 3: Offer descriptive feedback during the learning process (enabling strategy). Where the

How does drama help to develop English language skills.. In Forms 2-6, students develop their self-expression by participating in a wide range of activities

 Register, tone and style are entirely appropriate to the genre and text- type.  Text

If a contributor is actively seeking an appointment in the aided school sector but has not yet obtained an appointment as a regular teacher in a grant/subsidized school, or he