利用preimage分析萃取規則之實作

(1)

行政院國家科學委員會專題研究計畫成果報告

利用 preimage 分析萃取規則之實作

研究成果報告(精簡版)

計畫類別：個別型計畫編號： NSC 97-2410-H-004-117- 執行期間： 97 年 08 月 01 日至 98 年 07 月 31 日執行單位：國立政治大學資訊管理學系計畫主持人：蔡瑞煌計畫參與人員：學士級-專任助理人員：沈軒豪報告附件：出席國際會議研究心得報告及發表論文處理方式：本計畫可公開查詢

中華民國 98 年 07 月 17 日

(2)

行政院國家科學委員會補助專題研究計畫

■ 成果報告

□期中進度報告

利用 preimage 分析萃取規則之實作

計畫類別：■ 個別型計畫 □ 整合型計畫

計畫編號：NSC 97－2410－H－004－017－

執行期間： 97 年 8 月 1 日至 98 年 7 月 31 日

計畫主持人：蔡瑞煌

共同主持人：

計畫參與人員：沈軒豪

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立政治大學資訊管理學系

中華民國 98 年 7 月 15 日

(3)

報告內容

ARTIFICIALNEURALNETWORKSASAFEATUREDISCOVERYTOOL

This study proposes the preimage analysis and its associated belief justification process

regarding the application of continuous-valued single-hidden layer feed-forward neural

network (SLFN) to discovering features – certain relationships between explanatory (input) and

observed (output) variables – embedded in the (training) data. The preimage analysis explicitly

specifies the preimage of network and discloses its preimage-related properties. The preimage

of a given output of an SLFN is the collection of all inputs each of that generates the output.

The seminal publication of (Rumelhart & McClelland, 1986) states that Artificial Neural

Networks (ANN) can be trained primarily through examples; ANN can do the general

pattern-recognition; and ANN can learn general rules of optimal behavior. Since then, these

claims stimulate studies in many fields to develop various ANNs as modeling tools to check the

validity of the claims. Lots of experimental results are positive. For instance, Sgroi & Zizzo

(2007) state that ANNs “are consistent with observed laboratory play in two very important

senses. Firstly, they select a rule for behavior which appears very similar to that used by

laboratory subjects. Secondly, using this rule they perform optimally only approximately 60%

(4)

Results of these researches infer that well-trained ANNs possess features embedded in

(training) data. Thus, some practitioners may go one step further to conduct researches that

address the issue of extracting features embedded in data from well-trained ANNs. For instance,

based upon the empirical data, one wants to identify the relative influence of factors for pricing

newly issued securities. The practitioner first gets several well-trained networks and then from

these networks extracts certain features. The rule (1) is one of such feature examples. Hopefully,

the extracted rules could depict features embedded in data and could help identify the

significant factor.

Rule: If the input sample is in some sub-region of the input space, then the predicted price

value is given by a corresponding linear regression equation. (1)

Note that the feature-discovery practitioner (and researcher) may be merely interested in any

(ANN or statistical) tool that can help analyze the data to discover something interesting or

significant, instead of in the interaction of human cerebral activities and its explanation that

results in the fundamentals of ANN. Regardless, the task faced by a feature-discovery

practitioner is not easier because ANN simply behaves as a black box, i.e., a system that

produces “certain outputs from certain inputs without explaining why or how” (Rabuñal,

Dorado, Pazos, Pereira, & Rivero, 2004). Nevertheless, ANN researchers have spent substantial

(5)

from trained ANNs (rule extraction), and utilize ANNs to refine existing rule bases (rule

refinement).” (Andrews, Diederich, & Tickle, 1995, page 373) Some, but not exhaustive,

studies for these purposes are presented in (Andrews, Diederich, & Tickle, 1995; Setiono & Liu,

1997; Tickle, Andrews, Golea & Diederich, 1998; Tsaih, Hsu, & Lai, 1998; Taha & Ghosh,

1999; Zhou, Chen, & Chen, 2000; Saito & Nakano, 2002; Setiono, Leow, & Zurada, 2002;

Baesens, Setiono, Mues, & Vanthienen, 2003)

Despite these studies, the black box nature of ANN persists and any feature-discovery

intention is better to cope with following issues. First, the approach should disclose true

properties from ANN. Most of above studies do not seem to generalize to the situation since

they are contrived to explore restrictedly with limited data such that the extracted rules are

dubious and unlikely to be considered as patterns of knowledge embedded in data. Consider the

rule stated in (1). To spot its premise, most works use either training data or generated data,

which the trained network itself yields. Such an approach is data sensitive and requires

extensive amount of data to be accurate. The limited number of (training or generated) data

instances leads to a suspicion of the generalization of rule (1) in interpolating and extrapolating

any unexplored data values.

Second, since features learned by an ANN distribute over the entire network as weight values,

(6)

simultaneously investigate several kinds of features to get a better understanding about the

(training) data. The above studies do not seem to generalize to the situation, since they base

upon predefined schemes like rule (1) for extracting rules.

Third, it is possible to have a sub-perfect learning result since, in most studies (e.g., financial

market tests), it is characteristically more difficult to determine the best architecture of SLFN.

Besides, noises in the training samples prevent networks from perfect fittings. The possibility

of a sub-perfect learning result leads to a conservative attitude on embracing the obtained

preimage-related properties. The practitioner with such understanding has conservatism in the

straightforward feature acquisition. The above studies do not seem to cope with such

conservatism.

Fourth, the practitioner normally has some personal beliefs when he conducts the

feature-discovery experiment. Any such belief, if available, is in the form of tacit knowledge

about the relationship between the explanatory and observed variables. However, to draw

parallels between the beliefs and the observed features needs professional interpretations due to

the tacit nature of the former and the complex nature of the latter. That is, the practitioner has to

contrast the similarities and differences between beliefs and observed featuresand then to infer

properly. Even if the beliefs and observed features suggest different views, the practitioner may

(7)

justification is a kind of knowledge internalization stated by Nonaka & Takeuchi (1995). The

above studies do not seem to provide such discussions.

This study addresses these issues. Specifically, for feature discovery via well-trained SLFNs,

this study proposes the preimage analysis that explicitly specifies the preimage of network to

disclose its preimage-related properties. For complex preimage-related properties, this study

then proposes the following belief justification process, in which the practitioner’s (prior)

beliefs are refined based upon the examination results of preimage-related properties. The

beliefs lead to propositions of the experiment. Based upon the propositions, the practitioner

first picks up relevant explanatory and observed variables and collects the sample accordingly.

Then he trains SLFNs and, after the training, applies the preimage analysis to the selected

SLFNs. For each belief, the practitioner inspects relevant preimage-related properties. Such

inspection leads to a belief justification process. If there is no such prior belief, the rule (2) and

the observed preimage-related properties exclusively make statements about features embedded

in data.

Research findings of this study are summarized as follows:

(I) the preimage analysis is not data intensive and the inspected preimage-related properties

(8)

(III) the inspected preimage-related properties provide further insights about rule (2) and thus

about features embedded in data. In rule (2), x is the vector of explanatory variables; y is

the network’s response; f: X → Y is the function of the SLFN and y ≡ f(x); y’ is a constant; and the preimage f -1 is the inverse function of f.

Rule: If (x ∈ the f -1(y’) region), then (y = y’). (2)

(IV) several kinds of features can be simultaneously investigated through inspecting

preimage-related properties.

The remainder of this paper is organized as follows. Section II starts with the list of notations

used in the study and then gives the proposed preimage analysis. From the preimage analysis,

we find that rank(WH) determines characteristics of the preimage-related properties, in which

WH is the matrix of weights between the input variables and the hidden nodes and rank(D) is

the rank of matrix D. Hereafter, SLFN-p denotes an SLFN whose rank(WH) equals p. Section

III shows the application of preimage analysis to the two SLFN-1 network solutions of the 3-bit

parity problem. Some implications regarding the feature-discovery application via SLFN-1

networks are offered in Section VI. It is readily seen that the preimage-related properties of

SLFN-1 are easy to understand. However, learning algorithms adopted in most studies likely

result in SLFN-p networks with p ≥ 2 and these SLFN networks own complex preimages and

(9)

illustrated through the feature-discovery application to the bond pricing experiment, which

releases an SLFN-3 network.Some further discussions and future work are presented at the end.

THEPROPOSEDPREIMAGEANALYSIS

List of notations used in mathematical representations: Characters in bold represent column

vectors, matrices or sets; (⋅)T_{denotes the transpose of (⋅).}

I ≡ the amount of input nodes;

J ≡ the amount of hidden nodes;

x ≡ (x1, x2, …, xI)T: the input vector, in which xi is the ith input component, with i from 1

to I;

a ≡ (a1, a2, …, aJ)T: the hidden activation vector, in which aj is the activation value of

the jth hidden node, with j from 1 to J;

y ≡ the activation value of the output node and y = f(x) with f being the function mapping x to y;

H ji

w ≡ the weight between the ith input variable and the jth hidden node, in which the superscript H throughout the paper refers to quantities related to the hidden layer;

H j w ≡ (wH_j₁, H j w ₂, …,wH_jI)T; WH ≡ ( H 1 w , H 2 w , …, H J

w )T, the J×I matrix of weights between the input variables and the

hidden nodes;

H j

(10)

O j

w ≡ the weight between the jth hidden node and the output node, in which the

superscript O throughout the paper refers to quantities related to the output layer;

wO ≡ ( O w1 , O w2 , …, O J w )T; and O

w₀ ≡ the bias value of the output node.

Without any loss of generality, assume the tanh activation function is adopted in all hidden

nodes. Denote the collection of H j

w₀, wH_j , wO

,and w₀O by θ. Given θ, the resulting f of SLFN is

the composite of the following mappings: the activation mapping ΦA : ℜI → (-1, 1)J that maps

an input x to an activation value a (i.e., a = ΦA(x)); and the output mapping ΦO : (-1, 1)J → (w0O

-

∑

= J j O j w 1 , O w₀ +

∑

= J j O j w 1

) that maps an activation value a to an output y (i.e., y = ΦO(a)). Note that,

since the range of ΦA and the domain of ΦO are set as (-1, 1)J, the range in the output space ℑ ≡

( O w₀ -

∑

= J j O j w 1 , O w₀ +

∑

= J j O j w 1

) contains all achievable output values. For ease of reference in later

discussion, we also call RI the input space and (-1, 1)J the activation space.

Thus, f -1(y) ≡ ΦA-1(ΦO-1(y)) with

ΦO-1(y) ≡ {a ∈ (-1, 1)J|

∑

= J j j O ja w 1 = y - O w₀ }, (3) ΦA-1(a) ≡



J j 1= {x ∈ ℜI_|

∑

= I i j H jix w 1 = tanh-1(aj) -wHj0}, (4) where tanh-1(x) ≡

( )

-x x 1 1 ln 5 .

0 + . Formally, the followings are defined for every given θ:

(a) A value y ∈ ℜ is void if y ∉ f({ℜI}), i.e., for all x ∈ ℜI, f(x) ≠ y. Otherwise, y is non-void.

(b) A point a ∈ (-1, 1)J is void if a ∉ ΦA({ℜI}), i.e., for all x ∈ ℜI, ΦA(x) ≠ a. Otherwise, a is non-void. The

(11)

(c) The image of an input x ∈ ℜI is y ≡ f(x) for y ∈ ℑ.

(d) The preimage of a non-void output value y is f -1(y) ≡ {x ∈ ℜI_{| f(x) = y}_{}. The preimage of a}

void value y is the empty set.

(e) The internal-preimage of a non-void output value y is the intersection of ΦO-1(y) and the

non-void set on the activation space.

Given θ, the preimage analysis is conducted in the following four steps to specify the preimage:

Step 1: Derive the expression of ΦO-1(y);

Step 2: Derive the expression of the non-void set;

Step 3: Derive the expression of internal-preimage of a non-void output value y; and

Step 4: Derive the expression of preimage f -1(y).

From eqt. (3), with the given θ, ΦO-1(y) is a hyperplane in the activation space. As y changes,

ΦO-1(y) forms parallel hyperplanes in the activation space; for any change of the same

magnitude in y, the corresponding hyperplanes are spaced by the same distance. The activation

space is entirely covered by these parallel ΦO-1(y) hyperplanes, orderly in terms of the values of

y. These parallel hyperplanes form a (linear) scalar field (Tsaih, 1998). That is, for each point a of the activation space, there is only one output value y whose ΦO-1(y) hyperplane passes point

(12)

From eqt. (4), ΦA-1(a) is a separable function such that each of its components lies along a

dimension of the activation space. Moreover, ΦAj-1(aj) ≡{x ∈ ℜI|

∑

= I i j H jix w 1 = tanh-1(aj) -wHj0} is a

monotone bijection that defines a one-to-one mapping between the activation value aj and the

input x. For each aj value, ΦAj-1(aj) defines an activation hyperplane in the input space.

Activation hyperplanes associated with all possible aj values are parallel and form a (linear)

scalar activation field in the input space. That is, for each point x of the input space, there is

only one activation value aj whose ΦAj-1(aj) hyperplane passes point x; all points on the ΦAj-1(aj)

hyperplane are associated with the activation value aj. Each hidden node gives rise to an

activation field, and J hidden nodes set up J independent activation fields in the input space.

Thus, with the given θ, the preimage of an activation value a by ΦA-1 is the intersection of J

specific hyperplanes. The intersection



J j 1= {x ∈ ℜI_|

∑

= I i j H jix w 1

= tanh-1(aj) -wHj0} can be represented as {x| W

H

x = ω(a)}, where ωj(aj) ≡ tanh-1(aj) -wHj0 for all 1 ≤ j ≤ J, and ω(a) ≡ (ω1(a1), ω2(a2),…, ωJ(aJ))

T

.

Given θ and an arbitrary point a, ω(a) is simply a J-dimensional vector of known component values and the conditions that relates a with x can be represented as

WHx = ω(a), (5)

which is a system of J simultaneous linear equations with I unknowns.

Let rank(D) be the rank of matrix D and (D1D2) be the augmented matrix of two matrices

(13)

equations if rank(WHω(a)) = rank(WH) + 1 (c.f. (Murty, 1983)). In this case, the corresponding point a is void. Otherwise, a is non-void. Note that, for a non-void a, the

solution of eqt. (5) defines an affine space of dimension I - rank(WH) in the input space. The

discussion establishes Lemma 1 below.

Lemma 1: (a) An activation point a in the activation space is non-void if its corresponding

rank(WHω(a)) equals rank(WH). (b) The set of input values x mapped onto a non-void a forms an affine space of dimension I - rank(WH) in the input space.

By definition, the non-void set equals {a ∈ (-1, 1)J

| aj = tanh(

∑

= I i j H jix w 1 + H j w₀) for 1 ≤ j ≤ J, x ∈ ℜI}.

Check that WH is a J×I matrix. If rank(WH) = J, Lemma 1 says that no activation point a can be void and leads to Lemma 2 below. For rank(WH) < J, Lemma 3 characterizes the non-void set,

which requires the concept of manifold. A p-manifold is a Hausdorff space X with a countable

basis such that each point x of X has a neighborhood that is homomorphic with an open subset

of ℜp_{(Munkres, 1975). A 1-manifold is often called a curve, and a 2-manifold is called a}

surface. For our purpose, it suffices to consider Euclidean spaces, the most common members

of the family of Hausdorff spaces.

Lemma 2: If rank(WH) equals J, then the non-void set covers the entire activation space.

Lemma 3: If rank(WH) is less than J, then the non-void set in the activation space is a

(14)

A(y), the intersection of ΦO-1(y) and the non-void set in the activation space, is the

internal-preimage of y. Mathematically, for each non-void y, A(y) ≡ {a|rank(WH ω(a)) = rank(WH), a∈ ΦO-1(y)}. Consider first rank(WH) = J. In this case, Lemma 2 says that the

non-void set is the entire activation space. Thus, A(y) equals ΦO-1(y). If rank(WH) < J, then A(y)

is a subset of ΦO-1(y). Thus, we have the following Lemma 4. Furthermore, A(y)’s are aligned

orderly according to ΦO-1(y) and all non-empty A(y)’s form an internal-preimage field in the

activation space. That is, there is one and only one y such that a non-void a ∈ A(y); and for any a on A(y), its output value is equal to y.

Lemma 4. For each non-void output value y, all points in the set A(y) are at the same

hyperplane.

Now the preimage of any non-void output value y, f -1(y), equals {x ∈ ℜI| WHx = ω(a) with all a ∈ A(y)}. If rank(WH

) = J, then, from Lemma 2 and Lemma 1(b), the preimage f -1(y) is a

(I-1)-manifold in the input space. For rank(WH) < J, from Lemma 3 and Lemma 1(b),

1. if rank(WH) = 1 and A(y) is a single point, then f -1(y) is a single hyperplane;

2. if rank(WH) = 1 and A(y) consists of several points, then f -1(y) may consist of several

disjoint hyperplanes;

3. if 1 < rank(WH) < J and A(y) is a single (rank(WH)-1)-manifold, then f -1(y) is a single

(I-1)-manifold; and

(15)

f -1(y) consists of several disjoint (I-1)-manifolds.

Table 1 summarizes the relationship between the internal-preimage A(y) and the preimage f -1(y)

of a non-void output value y.

Table 1. The relationship between the internal-preimage A(y) and the preimage f -1(y) of a

non-void output value y.

The nature of A(y) A single intersection-segment

Multiple disjoint intersection-segments The nature of f -1(y) A single (I-1)-manifold Multiple disjoint (I-1)-manifolds

The input space is entirely covered by a grouping of preimage manifolds that forms a

preimage field. That is, there is one and only one preimage manifold passing through each x;

and the corresponding output value is the y value associated with this preimage manifold. Note

that the preimage manifolds are aligned orderly because A(y)’s are aligned orderly according to

ΦO-1(y)’s and the mapping of ΦA-1 is a monotone bijection that defines a one-to-one mapping

between an activation vector and an affine space.

Notice that rank(WH) determines the characteristic of the non-void set and thus the

characteristic of internal-preimage. For a SLFN-1 network, we can assume H j

w ≡ αjw for all j,

in which w is a non-zero vector and αjs are constants; as for a SLFN-p network with p > 1, we

can assume that vectors in the set of { H

1

w , H

2

w , …, H

p

w } are linearly independent and H j w ≡

∑

= P k H k jk 1 w γ for all j > p.

(16)

APPLICATIONTOSOMEEXAMPLESOFSLFN-1NETWORK

In this section, we show the application of the preimage analysis to the following two kinds of

SLFN-1 network solutions of the 3-bit parity problem, in which the target output is 1 if the

input vector contains an odd number of -1s and -1 otherwise: (1) the SLFN network solution

with seven effective hidden nodes shown in Table 2, constructed by Huang & Babri (1998), and

(2) the SLFN network solution with two hidden nodes shown in Table 3.

Tab le 2. An SLFN network solution of the 3-bit parity problem constructed by Huang & Babri (1998), in which O w₀ = 0.0 and w = (0.4, 0.5, 0.7)T. j wOj H j w 0 H j w 1 -239.9515868 0.651132681 0.0 w 2 65.96703854 0.283261412 -0.459839086 w 3 15.8645681 -0.452481126 -1.839356344 w 4 -369.5491494 0.467197046 -0.919678172 w 5 465.7997072 0.835068315 -0.919678172 w 6 -45.18900519 1.202939584 -0.919678172 w 7 -110.3929208 2.122617756 -1.839356344 w 8 128.6377379 1.386875218 -0.459839086 w

Tab le 3. An SLFN network solution of the 3-bit parity problem, in which O

w₀ = 0.0 and w = (1, 1, 1)T.

(17)

j wOj H j w 0 H j w 1 18.58899737 0.0 0.4 w 2 -30.7174688 0.0 0.2 w

For the SLFN-1 network with µ ≡ 0.4x1+0.5x2+0.7x3 shown in Table 2, the preimage analysis

states that its non-void set equals {a ∈ (-1, 1)8| a1 = tanh(0.651132681), a2 =

tanh(0.283261412-0.459839086µ), a3 = tanh(-0.452481126-1.839356344µ), a4 =

tanh(0.467197046-0.919678172µ), a5 = tanh(0.835068315-0.919678172µ), a6 =

tanh(1.202939584-0.919678172µ), a7 = tanh(2.122617756-1.839356344µ), a8 =

tanh(1.386875218-0.459839086µ), µ ∈ ℜ}, which is an 1-manifold in (-1, 1)8; A(y) equals {a ∈ (-1, 1)8| a1 = tanh(0.651132681), a2 = tanh(0.283261412-0.459839086µ), a3 =

tanh(-0.452481126-1.839356344µ), a4 = tanh(0.467197046-0.919678172µ), a5 =

tanh(0.835068315-0.919678172µ), a6 = tanh(1.202939584-0.919678172µ), a7 =

tanh(2.122617756-1.839356344µ), a8 = tanh(1.386875218-0.459839086µ), 65.96703854 a2 +

15.8645681 a3 - 369.5491494 a4 + 465.7997072 a5 - 45.18900519 a6 - 110.3929208 a7 +

128.6377379 a8 = y +239.9515868 tanh(0.651132681), µ ∈ ℜ}, which may consist of one or

several 1-manifold segments in (-1, 1)8; and f -1(y) equals {x ∈ ℜ3| 0.4x1+0.5x2+0.7x3 = µ,

(18)

110.3929208tanh(2.122617756-1.839356344µ) + 128.6377379 tanh(1.386875218- 0.459839086µ) = y +239.9515868 tanh(0.651132681), µ ∈ ℜ}, which may consist of one or several 2-manifold segments in ℜ3

.

For the SLFN-1 network with µ ≡ x1+x2+x3 shown in Table 3, the preimage analysis states

that its non-void set equals {a ∈ (-1, 1)2| a1 = tanh(0.4µ), a2 = tanh(0.2µ), µ ∈ ℜ}, which is an

1-manifold in (-1, 1)2; A(y) equals {a ∈ (-1, 1)2| a1 =tanh(0.4µ), a2 =tanh(0.2µ), 18.58899737 a1-

30.7174688a2= y, µ ∈ ℜ}, which may consist of one or several 1-manifold segments in (-1, 1)

2

;

and f -1(y) equals {x ∈ ℜ3| x1+x2+x3 = µ, 18.58899737tanh(0.4µ) - 30.7174688tanh(0.2µ) = y, µ ∈

ℜ}, which may consist of one or several 2-manifold segments in ℜ3

.

Fig. 1 shows the relationship between the value of µ and the output value y, regarding the SLFN-1 networks shown in Table 2 and Table 3. The relationship between the preimage f -1(y)

and the output value y of these two SLFN-1 networks can be observed from Fig. 1. The y-µ graph also indicates the generalization of these two SLFN-1 networks.

Figure 1: The relationship between the value of µ and the output value y, regarding the SLFN-1 networks shown in Table 2 and Table 3.

(19)

IMPLICATIONSOFTHEFEATURE-DISCOVERYAPPLICATIONOFTHESLFN-1

NETWORKS

In general, for any SLFN-1 network with µ ≡ wT

x and H j

w ≡ αjw for all j, the preimage analysis

states that its non-void set equals {a ∈ (-1, 1)J| aj = tanh(αj µ +wHj0) ∀ j, µ ∈ ℜ}, which is an

1-manifold in (-1, 1)J; A(y) equals {a ∈ (-1, 1)J|

∑

= J j O j w 1 tanh(αj µ +wHj0) = y -O w₀ , aj =tanh(αj µ +wHj0)

∀ j, µ ∈ ℜ}, which may consist of one or several 1-manifold segments in (-1, 1)J; and f -1(y) equals {x ∈ ℜI | wTx = µ,

∑

= J j O j w 1 tanh(αj µ +wHj0) = y -O w₀ , aj = tanh(αj µ +wHj0) ∀ j, µ ∈ ℜ}, which

may consist of one or several (I-1)-manifold segments in ℜI. These establish the following

H -4 -3 -2 -1 0 1 2 3 4 -5 -4 -3 -2 -1 0 1 2 3 4 5 SLFN-1 network shown in Table 2 SLFN-1 network shown in Table 3

(20)

orientation of the activation hyperplane in the input space corresponding to the jth hidden node.

Thus, we have Lemma 6.

Lemma 5: For SLFN-1, the preimage field is formed from a collection of preimage hyperplanes.

Lemma 6: For SLFN-1, the activation hyperplanes in the input space corresponding to all

hidden nodes are parallel, and the preimage hyperplane is parallel with the activation

hyperplane.

Outcomes of the preimage analysis lead to an understanding of the SLFN-1 itself and further

provides the following four insights about the usage of network and about the patterns

embedded in (training) data. First, SLFN-1 networks possess the hyperplane-preimage property,

which is their generalization. Therefore, the SLFN-1 should be used in the experiments desiring

a hyperplane-preimage relationship.

Second, the act to adopt the SLFN-1 architecture at the learning stage does already set the

hyperplane-preimage assumption and insert such feature into network. Third, when one gets a

SLFN-1 from training, he/she can infer that the empirical data bear the hyperplane-preimage

relationship. Fourth, the hyperplane-preimage relationship states that the observed variable of

interest is a function of a certain factor obtained from some linear combination of explanatory

variables. With such an insight, the practitioner may adopt a common regression method or

other suitable tool for data analysis after he/she has properly transformed the explanatory

(21)

application problem.

THEBONDPRICINGEXPERIMENT

In this section, we illustrate the feature-discovery process of a practitioner, who knows

bond-pricing mechanism well (but less than perfectly). Because bond pricing has been

well-studied in the literature, the purpose of this experiment is to illustrate the belief

justification process, not to discover extra features of bond pricing.

Before conducting the experiment, the practitioner has some personal beliefs and

propositions of the experiment. Based upon the propositions, he first picks up relevant

explanatory and observed variables and collects the sample accordingly. Below are the details

of his experimental design.

Let the theoretical bond price pc at time c is derived from (11), which serves as an example of

knowledge regarding the data.

∑ + + + ≡ = − − 0 0 1 , ) 1 ( ) 1 ( T k c k c c T c c r F r FR p (11)

where rc is the market rate of interest at time c; F = 100 is the face value of the bond; T0 is the

term to maturity at time c = 0; R is the coupon rate; and FR is the periodic coupon payment.

Then garbled bond prices yc are generated and used to simulate the set of data that may be

(22)

and variance (0.2)2, for all time c and bonds.

As depicted in Table 4, there are 18 hypothetical combinations of term to maturity and

contractual interest rate and generate a set of price data with c = 1/80, 2/80, …, 80/80 through

(11). The rate rc is derived from a normal random number generator of N(2%, (0.1%)2). Accordingly, there are 1,440 training samples with input variables Tc, R and rc, and the desired

output variable yc, where Tc ≡ (T0 - c) is the term to maturity at time c.

To examine the generalization of trained networks, the practitioner also generates 1,440 test

samples by similar means, except that T0, c, R and rc are randomly and independently generated

from {1, 2, …, 20} with a probability of 1/20 for each, {1/80, 2/80, …, 80/80} with a

probability of 1/80 for each, [0.0%, 3.0%] with a probability density function f(R) = 1/0.03, and

N(2%, (0.1%)2

), respectively. This setting results in varying instances among the test samples.

The Back Propagation learning algorithm of Rumelhart et al. (1986) is used to train 1,000

SLFNs, each of which has 4 hidden nodes and random initial weights and biases. Among these

1,000 SLFNs, the practitioner picks the three with the smallest mean square error (hereafter,

MSE) for the test samples. Table 5 shows the (final) weights and biases of these three networks,

hereafter named network I, II and III, respectively. The corresponding MSEs for the training

samples are 0.414, 0.404 and 0.451, respectively, and the corresponding MSEs for the test

samples are 0.429, 0.432 and 0.445, respectively. The average absolute deviation is

(23)

rank(WH_{) of all networks I, II and III are 3.}

Take network I to illustrate the result of applying the preimage analysis to these three

networks. ΦO-1(y) = {a| 15.1206a1 - 34.366a2 + 5.6589a3 - 21.9999a4 = y - 100.4744}, which is

in the form of a linear equation. Thus, for each non-void value y, ΦO-1(y) is a hyperplane in (-1,

1)4. Now WH =             − − − − − 6646 . 53 4188 . 36 0643 . 0 8267 . 16 3354 . 43 0988 . 0 9511 . 28 8286 . 36 0544 . 0 8396 . 18 7223 . 32 0347 . 0 (12)

and ω(a) = (tanh-1(a1) + 0.1689, tanh-1(a2) + 1.3535, tanh-1(a3) + 2.1615, tanh-1(a4) - 1.1698)T.

Thus the a vector satisfying the requirement of (13) corresponds to a non-void point; otherwise,

a void point. Moreover, for each non-void a, the system of simultaneous linear equations WHx

= ω(a) defines a point in the input space.

tanh-1(a4) = 2.646686748 + 3.248238694 tanh-1(a1) - 0.801390022 tanh-1(a2) + 0.931270242 tanh-1(a3). (13)

Thus, the non-void set equals {a| tanh-1(a4) = 2.646686748 + 3.248238694 tanh-1(a1) -

0.801390022 tanh-1(a2) + 0.931270242 tanh-1(a3)}, which is a 3-manifold in (-1, 1)4. A(y)

equals {a| 15.1206a1 - 34.366a2 + 5.6589a3 - 21.9999a4 = y - 100.4744, tanh-1(a4) =

2.646686748 + 3.248238694 tanh-1(a1) - 0.801390022 tanh-1(a2) + 0.931270242 tanh-1(a3)},

(24)

tanh-1(a2) - 2.216810863 tanh-1(a3), R = 0.012291155 + 0.01700793 tanh-1(a1) - 0.02138228

tanh-1(a2) + 0.017746672 tanh-1(a3), rc = 0.046189981 + 0.052432722 tanh-1(a1) - 0.015121128

tanh-1(a2) + 0.026740939 tanh-1(a3), 5.6589a3 - 21.9999 tanh(2.646686748 + 3.248238694

tanh-1(a1) - 0.801390022 tanh-1(a2) + 0.931270242 tanh-1(a3)) = y - 100.4744 - 15.1206a1 +

34.366a2, -1 < a1 < 1, -1 < a2 < 1, -1 < a3 < 1}, which may consist of one or several 2-manifold

segments in ℜ3.

As shown in Fig. 2, the preimage f -1 is a complex 2-manifold. According to his beliefs, the

practitioner inspects relevant preimage-related properties. Take the following three beliefs as

the illustration. First, the practitioner knows that the type of a bond, premium or discount, can

be determined by comparing the market interest rate with the contract coupon rate. Specifically,

if the coupon rate is greater than the market interest rate, then the bond is priced as premium,

else as discount. This belief leads to an insight that the preimage of each reliable network

should be parallel to the plane with this property that rc = R. As shown in Fig. 3, the preimages

of all three networks show the tendency predicted by the insight. Thus, he gives this belief a

high credibility.

Second, the practitioner understands that one bond with a greater coupon rate than another

should be priced higher at a given interest rate. From preimage of each network in Fig. 3, he

observes that there is a positive relationship between coupon rate and interest rate. Namely, the

(25)

a high credibility and conjectures further that the high interest rate results in the low bond price.

Third, from Fig. 4, the practitioner observes different curvatures for the premium bonds and

discount bonds in networks I and II, but not in Network III. Namely, in the rc and Tc

coordinates, the preimages for premium bonds appear to be concave and those for discount

bonds appear to convex for networks I and II. Thus, the practitioner gives low credibility to the

insight that, when the bond price is held constant, the rate of increase (respectively decrease) in

interest rate of a premium (respectively discount) bond increases as the maturity of the bound is

getting shorter.

IMPLICATIONSANDFUTUREWORK

The bond pricing experiment shows that the practitioner should have domain knowledge to set

up challenging propositions and collect the sample for network’s training as well as SLFN

knowledge to acquire reliable networks for feature discovery. And complex preimage-related

properties make the practitioner conduct belief justification process. For SLFN-p with p ≤ 3, the inspection of preimage-related properties could be conducted through the y-µ graph like Fig. 1 or the preimage graph in the input space like Fig. 2. For SLFN-p with p > 4, however, the

inspection of preimage-related properties relies on certain skills and experiences of nonlinear

(26)

Other possible future avenues of further enquiry may be the application of the proposed

preimage analysis to real world data and the externalization of belief into explicit knowledge

through SLFNs.

ACKNOWLEDGMENTS

This study is supported by the National Science Council of the R.O.C. under Grants NSC

92-2416-H-004-004, NSC 93-2416-H-004-015, and NSC 43028F.

REFERENCES

Andrews, R., Diederich, J., & Tickle, A. B. (1995). Survey and critique of techniques for

extracting rules from trained artificial neural networks. Knowledge-Based Systems, 8(6),

373-389.

Arslanov, M. Z., Ashigaliev, D. U., & Ismail, E. E. (2002). N-bit parity ordered neural

networks. Neurocomputing, 48, 1053-1056.

Baesens, B., Setiono, R., Mues, C., & Vanthienen, J. (2003). Using neural network rule

extraction and decision tables for credit-risk evaluation. Management Science, 49(3), 312-329.

Hohil, M. E., Liu, D. R., & Smith, S. H. (1999). Solving the N-bit parity problem using neural

networks. Neural Networks, 12(11), 1321-1323.

Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation.

(27)

Huang, G., & Babri, H. (1998). Upper bounds on the number of hidden neurons in feedforward

networks with arbitrary bounded nonlinear activation functions. IEEE Transactions on Neural

Networks, 9, 224-229.

Iyoda, E. M., Nobuhara, H., & Hirota, K. (2003). A solution for the N-bit parity problem using

a single translated multiplicative neuron. Neural Processing Letters, 18 (3), 213-218.

Lavretsky, E. (2000). On the exact solution of the Parity-N problem using ordered neural

networks. Neural Networks, 13(8), 643-649.

Liu, D. R., Hohil, M. E., & Smith, S. H. (2002). N-bit parity neural networks: new solutions

based on linear programming. Neurocomputing, 48, 477-488.

Munkres, J. (1975). Topology: a first course. New Jersey: Prentice-Hall Englewood Cliffs.

Murty, K. (1983). Linear Programming. New York, NY: John Wiley & Sons.

Nonaka, I., & Takeuchi, H. (1995). The knowledge-creating company. Oxford: Oxford

University Press.

Rabuñal, J., Dorado, J., Pazos, A., Pereira, J., & Rivero, D. (2004). A new approach to the

extraction of ANN rules and to their generalization capacity through GP. Neural Computation,

16, 1483-1523.

Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing: explorations in

(28)

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representation by

error propagation. In D. E. Rumelhart, and J. L. McClelland (Eds.), Parallel distributed

processing: explorations in the microstructure of cognition, vol. 1: foundation. Cambridge, MA:

MIT Press, 318-362.

Saito, K., & Nakano, R. (2002). Extracting regression rules from neural networks. Neural

Network, 15(10), 1297-1288.

Setiono, R., Leow, W. K., & Zurada, J. M. (2002). Extraction of rules from artificial neural

networks for nonlinear regression. IEEE Transactions on Neural Networks, 13(3), 564-577.

Setiono, R., & Liu, H. (1997). NeuroLinear: From neural networks to oblique decision rules.

Neurocomputing, 17(1), 1-24.

Setiono, R. (1997). On the solution of the parity problem by a single hidden layer feedforward

neural network. Neurocomputing, 16, 25-235.

Sgroi, D., & Zizzo, D. (2007). Neural Networks and bounded rationality. Physica A, 375,

717-725.

Taha, I. A., & Ghosh, J. (1999). Symbolic interpretation of artificial neural networks. IEEE

Transactions on Knowledge and Data Engineering, 11(3), 448-463.

Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light:

directions and challenges extracting the knowledge embedded within trained artificial neural

(29)

Tsaih, R., Hsu, Y., & Lai, C. (1998). Forecasting S&P 500 stock index futures with the hybrid

AI system. Decision Support Systems, 23(2), 161-174.

Tsaih, R. (1998). An explanation of reasoning neural networks. Mathematical and Computer

Modelling, 28, 37-44.

Urcid, G., Ritter, G. X., & Iancu, L. (2004). Single layer morphological Perceptron solution to

the N-bit parity problem. Lecture Notes in Computer Science, 3287, 171-178.

Zhou, R. R., Chen, S. F., & Chen, Z. Q. (2000). A statistics based approach for extracting

priority rules from trained neural networks. Proceedings of the IEEE-INNS-ENNS International

(30)

Table 4: the 18 hypothetical bonds with different combinations of term to maturity and

contractual interest rate.

Bond No. Term to maturity (T0) Contractual interest ratea (R) Bond No. Term to maturity (T0) Contractual interest rate (R) Bond No. Term to maturity (T0) Contractual interest rate (R) 1 2 0.0% 7 2 1.5% 13 2 3.0% 2 4 0.0% 8 4 1.5% 14 4 3.0% 3 7 0.0% 9 7 1.5% 15 7 3.0% 4 10 0.0% 10 10 1.5% 16 10 3.0% 5 15 0.0% 11 15 1.5% 17 15 3.0% 6 20 0.0% 12 20 1.5% 18 20 3.0% a

(31)

Tab le 5: final weights and biases of networks I, II and III, respectively. Weights and Biases Network O w₀ j wHj0 O j w wHj1 H j w 2 H j w 3 I 100.4744 1 -0.1689 15.1206 -0.0347 -32.722 3 18.8396 2 -1.3535 -34.366 0 0.0544 -36.828 6 28.9551 3 -2.1615 5.6589 0.0988 43.3354 16.8267 4 1.1698 -21.999 9 -0.0643 -36.418 8 53.6646 II 93.6583 1 0.4510 -23.387 4 -0.0571 -33.464 8 71.8090 2 0.8413 36.9871 -0.0467 32.9078 -10.592 6 3 -1.1572 -10.462 1 0.0699 45.5792 43.2855 4 1.2874 -9.2684 -0.0458 17.5685 -87.314

(32)

7 III 104.8248 1 0.7832 -14.035 2 -0.0519 -27.129 9 49.3836 2 1.3108 -16.729 7 -0.0571 -27.374 8 14.5874 3 -1.5287 -30.181 9 0.0631 -37.1108 33.3026 4 -0.6010 13.1504 -0.0524 -34.904 2 36.9149

(33)

Figure 2: The preimage graphs of Network I. The numbers within the parentheses are values of

y.

(34)

Figure 3: Preimage graphs along the rc and R plane for networks I, II, III (from top to bottom),

(35)

Figure 4: Preimages graphs along the plane of rc vs. Tc for networks I, II, III (from top to

(36)

計畫成果自評：

本研究內容與原計畫相符程度高，達成預期之研究目的。不過，也發現其後續研究之有趣以及困難處。本研究報告將加以修改後，投稿到學術期刊發表。

(37)

出席國際學術會議心得報告

計畫編號 NSC 97－2410－H－004－017 計畫名稱利用 preimage 分析萃取規則之實作出國人員姓名服務機關及職稱蔡瑞煌國立政治大學資訊管理學系教授

會議時間地點 14-19 June 2009, Atlanta, Georgia

會議名稱 2009 International Joint Conference On Neural Networks (IJCNN2009)

發表論文題目 Knowledge-Internalization Process for Neural-Network Practitioners 一、參加會議經過

我於 16/06/2009 凌晨到達 Atlanta 後，於會場上聆聽多場 Plenary Talk 及多篇論文之發表，亦於 16/06/2009 發表論文，於 17/06/2009 晚上離開 Atlanta 回國。附件是我所發表之論文。

二、與會心得

Plenary Talk 邀請了不少的 Neural Networks 學界裡之知名學者，例如 John Hopfield, Bernard Widrow, John Taylor, Walter Freeman 等人，來演講，我受益不少。我回國後，加以修改我所發表之論文，將投稿於期刊上。

(38)

Abstract—This study explores the knowledge-internalization process within which a neural-network practitioner embody the explicit knowledge obtained from extracting network’s preimage, the set of input values for a given output value, into his/her tacit knowledge. With a number of well-trained single-hidden layer feed-forward neural networks, the practitioner first extracts the (nonlinear) preimage of each trained network. The practitioner then internalizes the explicit outcomes and the insights obtained from the preimage extracting process into his/her tacit knowledge bases. We use the experiment of bond-pricing analysis to illustrate the knowledge-internalization process. This study adds to the literature by introducing the knowledge-internalization process. Moreover, in contrast to the data analyses in previous studies, this study uses mathematical analyses to identify networks’ preimages.

I. BLACK-BOXDILLEMAANDKNOWLEDGEACQUISITION

HEN practitioners apply Artificial Neural Networks (ANN) to resolving social science issues, there is a dynamic human process of justifying personal belief toward the “truth”. Reference [9] stated that ANNs can be trained (just as human children are taught), ANNs can learn primarily through example (as is often the case with humans), and ANNs can create general pattern-recognizing algorithms, learning general rules of optimal behavior. Since then, varieties of ANN have been developed and applied in many fields as modeling tools to see if the ANN does provide a model of human behavior and does approximate likely patterns of human behavior. At the beginning stage, a huge amount of experiments are conducted to see if the corresponding performances of the trained ANN are acceptable. Most experimental results are positive. For instance, [13] stated that ANNs “are consistent with observed laboratory play in two very important senses. Firstly, they select a rule for behavior which appears very similar to that used by laboratory subjects. Secondly, using this rule they perform optimally only approximately 60% of the time.” (p. 717) Later, the excitement shifts to applying ANN to resolving the challenging issue of domain. For instance, through extracting rules or features from a well-trained ANN, one tries to identify risk factors for newly issued securities, which have a prohibitively small number of observations. There are several concerns, however, when one has such application.

On the one hand, the practitioner has to cope with the

Manuscript received Feb 12, 2009. This work was supported in part by the National Science Council of the R.O.C. under Grants NSC 92-2416-H-004-004, NSC 93-2416-H-004-015, and NSC 43028F.

black box1

Rule: If the input sample is in some sub-region of the input

space, Then the predicted value is given by a corresponding linear regression equation. (1)

image of ANN to obtain a better understanding of relations between the input to ANN and its output. Reading or understanding the knowledge in ANN is difficult because the knowledge is distributed over the entire network and the relation between the input to ANN and its output is multivariate and nonlinear. Nevertheless, there is a huge amount of work that explore various “mechanisms, procedures, and algorithms designed to insert knowledge into ANNs (knowledge initialization), extract rules from trained ANNs (rule extraction), and utilize ANNs to refine existing rule bases (rule refinement).” [1, p. 373] Some, but not exhaustive, recent studies can refer to [1]-[2], [10]-[12], [14]-[15], [17]-[18]. These studies are contrived by the

engineering design with data analysis and approximation.

For instance, to identify the premise of a single rule stated in (1), most work use either training data or generated data, which the trained network itself yields. Due to the finite number of (training or generated) data instances, however, such a data analysis covers only some finite countable points in the (presumed) region of the rule premise, instead of the entire region. Reference [11] implemented a piecewise linear approximation on each hidden node to divide the input space into sub-regions in each of which, a corresponding linear equation that approximates the network’s output is defined as the consequent of the extracted rule to ensure the predicted value can be calculated from a comprehensible multivariate polynomial representation. Reference [3] solved the inversion problem through the back-propagation a union of polyhedra, which approximate (arbitrarily well) any reasonable set.

On the other hand, instead of a knowledge acquisition process, practitioners conduct a knowledge internalization process within which they embody the explicit outcomes and the insights obtained from the experiment into their tacit knowledge. Knowledge is normally tacit -- highly personal and hard to formalize. Subjective insights, intuitions and hunches are common heard from the discussions and sometimes difficult to replicate as the validation process depends on certain skills and experience. It is not trivial to conduct such knowledge internalization even when some explicit outcome is extracted from ANN. Furthermore, in most social science applications (e.g., financial market tests), the knowledge internalization process needs to cope with the

1_{A black box refers to a system that produces “certain outputs from}

certain inputs without explaining why or how.” [7, pg. 1483] The black box image for many years has gradually discouraged the study or application of the ANN.

Knowledge-Internalization Process for

Neural-Networks Practitioners

(39)

unlikely perfect learning due to the defect design of the architecture of ANN2

II. THE KNOWLEDGE INTERNALIZATION

and the garbled data. In literature, however, there are no discussions connecting to such knowledge internalization process.

This study explores such a knowledge internalization process. Specifically, the ANN used here is the real-valued single-hidden layer feed-forward neural networks (hereafter also referred to as SLFN) with one output node. Furthermore, the following assumption is set to help average out noises in estimates from individual SLFNs and serve as a stabilization measure to the knowledge internalization process: The practitioner should have a number of SLFNs, each of which is perceived well-trained by the practitioner, although does not necessarily provide a globally optimal learning result.

In order to not trap in the criticisms due to adopting the data analysis and approximation, mathematical analysis is adopted here to explicitly specify the (nonlinear) preimage of each SLFN’s mapping and thus the rule (2):

Rule: If (x ∈ the f -1(y’) region), then (y = y’), (2) where x is the vector of explanatory variables; y is the network’s response; f : X → Y is the function of the trained SLFN and y ≡ f(x); y’ is a constant; and the preimage f -1_is

the inverse function of f. The preimage f -1(y’) also represents the collection of inputs of the given output value y’.

The function representation f and preimage f -1 of the obtained SLFN are instances of explicit outcomes that can be “easily communicated and shared in the form of hard data, scientific formula, codified procedures, or universal principles” [6, p. 8]. When the practitioner conducts the experiment without domain expertise, the explicit outcomes and the insights obtained within the extracting process make some statement exclusively. But when there is certain prior belief, the practitioner focuses on the credibility of belief, the extent to which the belief can be generalized within the preimage-extracting process and the subsequent examination process. At the end of knowledge internalization process, the practitioner can have a posterior belief that has “high credibility” if the explicit outcomes and the insights obtained within the extracting process corroborate the belief; and “low credibility” if some corresponding result contradicts and weakens the belief. That is, when the explicit outcomes and the insights obtained within the extracting process are totally consistent with the practitioner’s prior belief, he/she may accord his/her belief a high credibility. Conversely, an inconsistent result triggers the following examinations of SLFNs and belief instead of an immediate rating:

(i) Investigating whether there exist factors or noises leading to a defect design of the SLFN such that all well-trained SLFNs are not suitable for the purpose of rating the credibility of belief.

(ii) Examining whether some of the obtained SLFNs are optimal to the extent that they are suitable for the

purpose of rating the credibility of belief.

(iii) Consolidating the explicit outcomes the obtained insights amongst all reliable SLFNs.

(iv) Contrasting the belief with the consolidated outcome of reliable SLFNs.

Only when the practitioner feels certain that he/she can eliminate the first possibility, should he/she rate the credibility of belief. Furthermore, the practitioner would conservatively follow the explicit outcomes and the obtained insights.

Section III uses the experiment of bond-pricing analysis, which in literature is a nonlinear regression problem with continuous variables, to illustrate the knowledge internalization process. At the end, conclusion and future work are offered.

III. THEBONDPRICINGEXPERIMENT

In order to simulate the set of data that may be observed by a representative practitioner, who know about the bond-pricing mechanism well (but less than perfectly), garbled training samples of bond price yt = pt + εt are generated and used. pt

is the theoretic value of the bond at time t and is derived from (3), which serves as an example of complete domain knowledge with respect to the bond pricing model, and εt is

a white error term provided by a normal random number generator of N(0, (0.2)2). Namely, yt is perturbed by a white

noise. pt ≡

∑

= + − 0 1(1 ) T k t k t r C ₊ t T t r F − + ₎0 1 ( (3)

According to (3), pt is determined by (i) rt, the market

rate of interest at time t; (ii) F, the face value of the bond, which generally equals 100; (iii) T0, the term to maturity at

time t = 0; and (iv) C, the periodic coupon payment, which equals F×rc. As depicted in Table 1, we use 18 hypothetical

combinations of term to maturity and contractual interest rate and generate a set of price data with t = 1/80, 2/80, …, 80/80 through (3). The rate rt is derived from a normal

random number generator of N(2%, (0.1%)2

). Accordingly, we have 1,440 training samples with input variables Tt, rc

and rt, and the desired output variable yt, where Tt ≡ (T0 - t)

is the term to maturity at time t.

To examine the generalization of trained networks, we also generate 1,440 test samples by the similar means, except that T0, t, rc and rt are randomly and independently

generated from {1, 2, …, 20} with a probability of 1/20 for each, {1/80, 2/80, …, 80/80} with a probability of 1/80 for each, [0.0%, 3.0%] with a probability density function f(rc) =

1/0.03 and N(2%, (0.1%)2_{), respectively. This setting results}

in varying instances among the test samples.

We, as the representative practitioner, adopt the Back Propagation learning algorithm [8] to train 1,000 SLFNs, each of which has 4 hidden nodes and random initial weights and biases. Among the 1,000 SLFNs, we pick the three with the smallest mean square error (hereafter, MSE) for the test

(40)

respectively. The corresponding MSEs for the training samples are 0.414, 0.404 and 0.451, respectively; and the corresponding MSEs for the test samples are 0.429, 0.432 and 0.445, respectively. The average absolute deviation is approximately 0.6, which deviates from the specified error term standard deviation of 0.2. The pricing error is unrelated to theoretic prices pt. but related to observed prices yt.

By definition, for each SLFN, the hth activation value ah

equals tanh(bhH+ 3 1 = ∑ i H hi

w xi), h = 1, ..., 4; the output y equals

bo + 4 1 = ∑ h o h

w ah; and the function f equals bo +

4 1 = ∑ h o h w tanh( H h b + 3 1 = ∑ i H hi

w xi). Hereafter, let (⋅)T be the transpose of (⋅) for (⋅) to

be a vector or a matrix. Furthermore, o

h

w ≡ the weight of the hth_{activation value for the output,}

where the superscript o throughout the paper indicates quantities related to the output layer;

bo ≡ the bias of the output node;

H hi

w ≡ the weight of the ith input for the hth hidden node, where the superscript H throughout the paper indicates quantities related to the hidden layer;

H h⋅ w ≡ ( H h w1, H h w2, H h w3) T

, the 3x1vector of weights between the hth hidden node and the input layer;

WH ≡ ( H ⋅ 1 w , H ⋅ 2 w , H ⋅ 3 w , H ⋅ 4

w )T, the 4x3 matrix of weights between

the hidden nodes and the input layer; and H

h

b ≡ the bias of the hth hidden node.

For ease of reference in later discussions, we also call R3 the

input space and (-1, 1)4 the activation space. For each SLFN, f -1(y) = Φtanh-1。Φo-1(y), with

Φo-1(y) ≡ {a ∈ (-1, 1)4| 4 1 = ∑ h o h w ah = y - bo}, (4) Φtanh-1(a) ≡  4 1 1 -T } -) ( = | { = ⋅ ℜ ∈ h H h h H h tanh a b w x x 3 _{, (5)}

where Ω is a subset of ℜ and tanh-1(x) ≡

( )

_-xx

1 1 ln 5 . 0 + . Formally, the followings are defined for every SLFN:

(i) A value y ∈ ℜ is void if y ∉ f({ℜ3_{}), i.e., for all x ∈ ℜ}3_,

f(x) ≠ y. Otherwise, y is non-void.

(ii) A point a ∈ (-1, 1)4_{is void if a ∉ Φ}

tanh({ℜ

3

}), i.e., for all x ∈ ℜ3_,_Φ

tanh(x) ≠ a. Otherwise, a is non-void. The

set of all non-void a’s in the activation space is named as the non-void set.

(iii) The image of an input x ∈ ℜ3_{is y ≡ f(x) for y ∈ Ω.}

(iv) The preimage of a non-void output value y is the set f -1 (y) ≡ {x ∈ ℜ3

| f(x) = y}. The preimage of a void value y is the empty set.

(v) The internal-preimage of a non-void output value y is the set {a ∈ (-1, 1)4_{| Φ}

o(a) = y} on the activation space.

Given the weights and biases of each SLFN, the preimage-extracting phase conducts the following steps, where rank(D) is the rank of the matrix D and [D1 D2] be

the augmented matrix of two matrices D1 and D2 (with the

same number of rows):

Step 1: Derive the expression of Φo-1(y);

Step 2: Derive the expression of the non-void set that is defined as {a| rank(WHω(a)) = rank(WH)};

Step 3: Derive the expression of A(y) that is defined as {a|

a∈ Φo-1(y) AND rank(WHω(a)) = rank(WH)}; and

Step 4: Derive the expression of f -1(y) that is defined as {x|

WHx = ω(a) with all a ∈ A(y)}.

Take Network I to illustrate the explicit outcomes and the insights obtained within the extracting process. Φo-1(y) =

{a| 15.1206a1 - 34.366a2 + 5.6589a3 - 21.9999a4 = y -

100.4744}. Φo

-1

(y) is in the form of linear equation. Thus, for each non-void value y, Φo-1(y) is a hyperplane in (-1, 1)4.

As y changes, Φo-1(y) forms parallel hyperplanes in (-1, 1)4;

for any y changes of the same magnitude, the corresponding hyperplanes are spaced by the same distance. The activation space is entirely covered by these parallel Φo-1(y)

hyperplanes, orderly in terms of the values of non-void y. Furthermore, the center of these parallel hyperplanes is the Φo-1(100.4744) hyperplane. These parallel hyperplanes form

a (linear) scalar field: For each point a of the activation space, there is only one output value y whose Φo-1(y)

hyperplane passes point a; all points on the same Φo-1(y)

hyperplane are associated with the same y value. Note that the function xT H

h⋅

w = tanh-1(ah)

-H h

b within the

hth component in the right-hand side of (5) is a separable function from the one within the other components. Given an activation value ah, {x ∈ ℜ3| xT H h⋅ w = tanh-1(ah) -H h b } defines a hyperplane in the input space, since all H

h⋅ w and H

h

b are given constants. For the hth hidden node, the hyperplanes associated with various ah values are parallel

and form a (linear) scalar activation field in the input space [16]: For each point x of the input space, there is only one activation value ah whose corresponding hyperplane passes

point x; all points on this hyperplane are associated with the same ah value. Furthermore, each hidden node gives rise to

an activation field in the input space, and four hidden nodes set up four independent activation fields in the input space.

4 1 1 -T } -) ( = | { = ⋅ h H h h H h tanh a b w x x in (5) can be denoted by

{x|WHx = ω(a)}, where ω(a) ≡ (ω1(a1), ω2(a2), ω3(a3),

ω4(a4))T and ωh(ah) ≡ tanh-1(ah)

-H h

b for all 1 ≤ h ≤ 4. Given

the activation values of a, ω(a) is simply a vector of known component values and the representation

WHx = ω(a) (6)

is a system of four simultaneous linear equations with three unknowns. Furthermore, WHx = ω(a) is a set of inconsistent

simultaneous equations if rank(WHω(a)) = rank(WH) + 1 [5, p. 108], and thus the corresponding point a is void. The discussion establishes Lemma 1 below.

Lemma 1. An activation value a is void if rank(WHω(a)) = rank(WH) + 1; otherwise, a is non-void.

利用preimage分析萃取規則之實作

行政院國家科學委員會專題研究計畫 成果報告

利用 preimage 分析萃取規則之實作

研究成果報告(精簡版)

中 華 民 國 98 年 07 月 17 日

行政院國家科學委員會補助專題研究計畫

■ 成 果 報 告

□期中進度報告

利用 preimage 分析萃取規則之實作

計畫類別：■ 個別型計畫 □ 整合型計畫

計畫編號：NSC 97－2410－H－004－017－

執行期間： 97 年 8 月 1 日至 98 年 7 月 31 日

計畫主持人：蔡瑞煌

共同主持人：

計畫參與人員： 沈軒豪

成果報告類型(依經費核定清單規定繳交)：■精簡報告 □完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

■出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立政治大學資訊管理學系

中 華 民 國 98 年 7 月 15 日

∑

∑

∑

∑

∑



∑

( )

∑



∑

∑

∑

∑

∑

出席國際學術會議心得報告

Knowledge-Internalization Process for

Neural-Networks Practitioners

∑

( )

行政院國家科學委員會專題研究計畫成果報告

中華民國 98 年 07 月 17 日

■ 成果報告

計畫參與人員：沈軒豪

中華民國 98 年 7 月 15 日