The Evolution of Internal Representation

全文

(1)!E%immTICAL CObSHJTER MODELLING. PERGAMON. Mathematical. and. Computer. Modelling. 38 (2003). 339-350 www.elsevier.com/locate/mcm. The Evolution. of Internal. Representation. RAY TSAIH Department. of Management Information Systems, National Chengchi University No 64, Chih-nan Road, Sect. 2, Wenshan 11623. Taipei, R.O.C. tsaih@mis.nccu.edu.tw (Received. Abstract-To that can adjust not mechanisms are (1) (2). September. 2002; revised and accepted February. develop an appropriate only weights but also. the recruiting the reasoning. mechanism mechanism. that that. internal representation, the number of adopted. recruits prunes. proper extra hidden potentially irrelevant. This learning algorithm can make use of external environmental sentation appropriate for the required mapping. The encoding used to demonstrate the performance of the proposed algorithm. positive. @ 2003 Elsevier Ltd. All rights reserved.. Keywords-Internal delta rule.. representation,. a deterministic hidden nodes. Recruiting. nodes, hidden. 2003) learning is proposed.. algorithm The key. and nodes.. clues to develop an internal repreproblem and the parity problem are The experimental results are clearly. mechanism,. Pruning. mechanism,. Generalized. 1. INTRODUCTION In modern finance, derivatives such as futures and options play increasingly prominent roles in risk management and price speculation. Owing to the high leverage involved in derivative trading, investors can gain enormous profits with a small amount of capital if they can accurately predict the market’s direction. Financial markets, however, can be influenced by many factors, such as political events, general economic conditions, and traders’ expectations. Generally, predicting the financial market’s movements is considered more difficult than expected. Movements in market prices are not random. Rather, they behave in a highly nonlinear, dynamic manner. The standard random walk assumption of futures prices may merely be a veil of randomness that shrouds a messy nonlinear process (see, for example, [l-3]). To make the forecasting of futures prices more reliable, the application of artificial neural networks (ANN), especially the layered feed-forward network [4], has received extensive attentions [3,5,6]. Instead of directly deriving the nonlinear equation, the layered feed-forward network tries to develop an appropriate internal representation for such forecasting problem. In general, a nonlinear forecasting problem is like the problem of finding a nonlinear equation to capture the general Research supported are also due to W. earlier draft of this. by National Ke of National paper.. 0895-7177/03/S - see front doi: tO.l016/SO895-7177(03)00225-5. matter. Science Council of R.O.C. Chengchi University and. @ 2003. Elsevier. Ltd.. under Grants two anonymous. All rights. reserved.. No. NSC reviewers. 91-2416-H-004-009. for useful comments. Typeset. Thanks on an. hy &+S-Q$.

(2) 340. R.. TSAIH. Output Patterns. Internal Representation units. Input. Pattern. Figure. 1. An. ANN. with. a layer. of hidden. nodes. pattern of a relationship between the independent variables xjs and the dependent variables yls. The form of the equation is yl = Fl(x), where x is the vector of independent variables xj, and FL is a nonlinear function derived from a given data set of samples {(lx, I&), . . . , (NX, Ntl)} with ,tl the observed value of y1 corresponding to ,x. In the context of a layered feed-forward network as shown in Figure 1, the information x coming to the input nodes is recoded into an internal representation h E (hi, hz, . . . , /J~)~, and the output 01, the estimated value of yl , is generated by the internal representation h rather than by the original pattern x. “Input patterns can always be encoded, if there are enough hidden units, in a form so that the appropriate output pattern can be generated from any input pattern,” as mentioned in [4, p. 320). Let us take as an illustration a layered feed-forward network with the hyperbolic tangent (tanh) used in all output and hidden nodes. Given function [4,7], where tanh(x) s (ez -eTz)/(ez fe-“), the cth stimulus =x, the activation value of the ith hidden node h(cx, swi) and the activation values of the Zth output node O(Cx,sw~,sw) are as follows: h(cx, zwi) s tanh. (1).

(3) Internal. 341. Representation. where m, p, and q are the numbers of input, hidden, and output nodes, respectively; 2w,c is the bias of the ith hidden node, swij is the weight of connection between the jth input node and the ith hidden node, 32~10is the bias of the lth output node, and 3~1%is the weight of the connection between ith hidden nodes and the lth output node. Character in bold represents a column T= vector and the superscript T indicates the transposition: zw’ z (zwic, sw,i, . . . , 2wlm), 2w T T E (3wT, 3W2,... ,3w,‘)* and wT = 3w (m;, 2w :,. . 3 2$), 3w[ = (3WO,3Wlr. . ,3Wp), (2w. T73wT).. Given a set of training samples {(ix, it), . . , (NX, Nt)}, the goal of learning is to seek a (w,p) that renders 10(cx, swi, 2~) -&l) < E for all c E (1,. . . , N} and all 1, where E is a given acceptable tolerance, say 10e6. In general, the learning can be recognized as a minimization of the spurn of residual squares E(w,p), set thus,. ‘fuli;E(w,p). s rr$rt. k(o(cx,3w112w) c=l. -. ct~)~.. i3j. I=1. This is a complicated unconstrained nonlinear programming problem. The process of an optimization algorithm applied to problem (3) is similar to the process of searching along the surface defined by the sum of residual squares in the {w, p} space composed of all possible (w,p). Instead of desperately obtaining a globally optimal solution (w*, p*) of problem (3), many researchers are motivated to develop a reasonable algorithm to explore an acceptable learning solution (w,p) that renders ]O(cx,3wr,sw) - ,tz) < E for all c E (1. N} and all 1. Developing such an algorithm is not easy due to the necessity of coping with the following characteristics: (1) the {wlp) sP ace is unbounded since the number of possible (w, p) is infinite; (2) the surface defined by values of E(w,p) over the {w,p} space is nondifferentiable since. for example, changes in the value of p are discrete and can have discontinuous effects on the value of E(w,p); (3) the surface is nonanalyzable since the mapping from (w, p) to the value of E( w. p) is not, yet analyzable; (4) the surface is complex and deceptive since values of E(w,p) with similar (w,p) may be dramatically different, and with quite different (w,p) may be very similar. Due t,o these characteristics, there is hardly any solid theoretical support for developing a reasonable learning algorithm. In the context of internal representation, the activation functions adopted for all hidden nodes are predefined and fixed. Thus, the internal presentation evolves when the values of p and 2w are altered. Because of the linear nature of computing the net input value and the semilinear characteristic of the activation function adopted in equations (1) and (2), there is a level-adjacent mapping from the {x} space to the {h} space, and from the {h} space to the output layer {O} space. Level-adjacent mapping means that level-adjacent points in the previous-layer space are mapped to neighboring points in the latter-layer space [7]. While the proximity of two points in the latterlayer space is measured by their (direct) distance, the level-adjacency between two points in the previous-layer space is measured by the difference between their associated activation levels. In the learning stage, the positions of the training stimuli {ix,. . , NX} in the {x) space art’ given and fixed; however, {h(rx, zw), . . . , h(Nx, zw)} are determined by p and zw , and each 0(=x, 3~1,s~) is determined with h(,x,sw) and 3~1, As stated in [7], no matter what kind of learning algorithm is used, the resulting mapping between two consecutive layers is always a leveladjacent mapping. Furthermore, the crux of learning is to adjust p and 2w to render the internal representation appropriate for the learning task. An internal representation is appropriate for.

(4) 342. R. TSAIH. the task if there is a sw such that ]O(h( cx,2w),3wl) - &I] < E for all c E (1,. . . , N} and all 1. Currently, however, the development of the internal representation is related to the current values of p and w. In the literature, there are two categories of learning algorithms for layered feed-forward networks: evolutionary ANN (EANN) algorithms, which are stochastic, and weight-and-structurechange learning algorithms, which are deterministic. Over past years, researchers have applied evolutionary computation to problems whose solution space is so large and complex that it is difficult to employ conventional optimization procedures to search for a global optimum. Evolutionary computation refers to a collection of stochastic searching algorithms whose designs are based upon the ideas of genetic inheritance and the Darwinian principle of the survival of the fittest (natural selection). There are several distinguished styles of evolutionary algorithms: l. evolutionary. strategies (ES),. l. evolutionary programming (EP), genetic algorithms (GA), and. l. genetic programming. l. (GP).. All of them model the search process over the solution space by mimicking biological evolution. They differ mainly in the evolution operators involved and the representation of the solution space. Most researchers believe that evolutionary computation should not be considered as a kind of optimization technique to compete with other alternative techniques, but an optimization principle to be incorporated into existing techniques. Thus, they propose to apply EP, GA, and GP to the determination of the network structure of an ANN. This gives rise to three classes of ANN, EPNN, GANN, and GPNN. All of these classes are portions of EANN. The most promising EANNs involve a global search algorithm that is stochastic (81. In contrast, all weight-and-structure-change learning algorithms adjust w and p in a deterministic way; for example, the tiling algorithm [9], the cascade-correlation (CC) algorithm [lo], the upstart algorithm [ll], the W&S algorithm [12], the CTN algorithm [13], and the softening algorithm [7,14,15]. They adjust the network structure in one of the following ways. (1) Destructively: using excess hidden nodes initially and pruning (removing) least effective hidden nodes during the learning process; e.g., W&S algorithm and CTN algorithm. (2) Constructively: using less hidden nodes initially and recruiting (adding) more hidden nodes during the learning process; e.g., the tiling algorithm, CC algorithm, and the upstart algorithm. (3) Aggregately: using only one hidden node initially, and recruiting as well as pruning hidden nodes during the learning process; e.g., the softening algorithm. These deterministic learning algorithms have the ability to build up an appropriate internal representation via recruiting/pruning hidden nodes and altering the associated weights during the learning process. Here, I introduce a deterministic learning algorithm that makes use of sequentially presented training samples to adjust the values of w and p to develop an internal representation appropriate for the required mapping. Moreover, this learning algorithm guarantees an acceptable learning result. Recall that a learning result is acceptable if )O(cx, sw1,2w) -&I < E for all c E { 1, . . . , N} and all I, where e is a given acceptable tolerance. The remainder of this paper is organized as follows. The proposed learning algorithm and its theoretical justification are introduced in Section 2. Empirical justification of the proposed algorithm is given in Section 3. The encoding problem and the parity problem presented in [16] will be used here to demonstrate the performance of the proposed algorithm. Finally, conclusions and future work are presented in Section 4. For the simplicity of presentation, all theoretical proofs are given in the Appendix..

(5) Internal. 343. Representation. 2. THE PROPOSED. LEARNING. ALGORITHM. Without losing generality, let q = 1 in the explanation of our design. Table 1 presents the general procedure and Figure 2 displays the flow chart of the proposed algorithm. The key mechanisms are (1) the recruiting mechanism that effectively recruits proper extra hidden nodes, and (2) the reasoning mechanism that effectively prunes potentially irrelevant hidden nodes. The details of the proposed algorithm Table Step. 0: Set one hidden. node. Step. I: If it is the. of the sample. Step. 2: Present. Step. end. the. 3: If. IO(kX,. Step. 3.1:. Store. Step. 3.2:. Apply. kth given. - kt\ > E, then weights.. (2) If an unacceptable Set X = 1.. (b). F&store. (c) p+. the. p. 2 +. -AaT),. result. E I(k),. randomly;. algorithm.. set k = 1.. STOP. kt).. mechanism. < E, Vc. deterministic. sequence,. (kx,. weight-tuning. w) - .tl. proposed. assigned. input. sample. w). (a). 1. The. weights. the the. (1) If 10(,x,. with. at each step are listed below.. to adjust. then. is obtained,. go to Step. weights. until. one of the. following. then. and. recruit. two. extra. minCEI. hidden. laT(kx. nodes. - Cx)l,3~P1. with. ~wr-1~. =(<-kTkx,kXT),2~,T =. =3wp. tanh. -l(kt). -. =(<+kTkx. 3wO-. ( /2tanh(<),. where the. (i) If (0(,x, (ii). the. length. weight-tuning w). 4: Prune. Step. 5: k + 1 -t. of vector. mechanism. - &( < E,. If an unacceptable. Step. oc~rs:. weights.. C= lo-’. (d) Apply. cases. 4.. Vc E Z(k),. result. all potentially. irrelevant. k; go to Step. cx is one and aT(kx to adjust then. is obtained, hidden. weights. go to Step. until. p-2 c i=l. IWih(kxt. 2wi). - ,x) # 0, Vc E I(k - 1). one of the. following. cases. occurs:. 4.. let X * 2 -+Xandp-2-+p,thengoto(b). nodes.. 1.. The pairs of (cx, $) are presented sequentially. At the lath stage, the stage when the lath sample (kx, kt) is processed, the goal is to get values of (w,p) that. lO(cx,w) - Jl < E,. vc E I(k) = (1,. . . , k}.. (4). The learning proceeds by evolving the internal representation to render it appropriate for accomplishing goal (4). The internal representation that can accomplish goal (4) is an appropriate one. Note that accomplishing goal (4) is a sufficient, but not a necessary condition to obtain an appropriate internal representation. However, the effort of regularly checking whether of not the current internal representation is more complicated than the one of regularly checking the current internal representation is more complicated than the one of regularly checking if goal (4) is currently accomplished. This argument accounts for adopting goal (4) in the arrangement of this learning algorithm. The algorithm ensures that goal (4) is attained at the end of each stage. Consequently, it guarantees an acceptable learning result at the end. Specifically, when the lath sample (kx, kt) is processed, we first check if goal (4) is accomplished. If so, there is only a reasoning effort involved. Then the next given sample is processed. If not, in our next step (Step 3), the weight-tuning mechanism implementing the momentum version of the generalized delta rule [4] with automatic adjustment of the learning rate [15) is applied to minw &(W) to adjust weights, where J?%(w) = x,.el(k) (tanh(sws + CT=‘=,swi tanh(zwic +.

(6) R. TSAIH. onchddcnnodcand onchddcnnodcend R-1 R-1 _---I_.-_______----p--e---. Fhle potentislly irre1evMtbjdden nodes. .-. L. -. Figure. 2. The. flow. chart. of the. proposed. learning. algorithm.. 2. Namely, the objective function used in the optimization process at the kth stage l!&(w) is now defined as the current sum of residual squares and parameter p remains the same. Such a weight-tuning mechanism attempts to achieve goal (4). Rumelhart et al. in [4] have argued that a mechanism that implements the generalized delta rule can learn internal representations by error propagation. Unfortunately, this weight-tuning mechanism has the power to alter the weights, yet no power to add or delete hidden nodes. Moreover, this mechanism may converge on the neighborhood of an undesired attractor of min, & (w) in which VW,?&(w) = 0; for example, a relatively optimal solution or a saddle point solution. Another possible failure is the case that the current network structure is a defective one. All of these situations lead to an unacceptable result. These are indicated in Figure 3. Path B indicates the situation when the result of implementing the weight-tuning mechanism is an unacceptable one. A perfect weight-tuning mechanism that can avoid the predicament of converging on an undesired attractor is desirable, because the defective network structure will be the only cause of an unacceptable result. Unfortunately, there is currently no such perfect weight-tuning mechanism. xi”=,. zqjcq)). -. cq.

(7) Internal. Representation. Figure 3. The flow chart of the weight-tuning tum version of the generalized delta rule with rate. E’ and 8’ are given tiny numbers.. mechanism automatic. implementing adjustment. the of the. momenlearning. Under this constraint as well as with the consideration of computing complexity, the current weight-tuning mechanism is adopted. The consideration of computing complexity is important because the weight-tuning mechanism will be triggered frequently during the learning process. Extra hidden nodes are recruited to handle the predicament in Path B of Figure 3 in the following manner. Action (b) in Step 3.2 restores the weights stored in Step 3.1. Assume the goal of the previous stage is accomplished at the end of the previous stage. Then, by restoring the weights.

(8) 346. R.. TSAIH. stored in Step 3.1, we return to the internal representation that renders 10(Cx, w) -&I < E for all E I(k - 1) and [O&X, W) - kt( 2 e. For Action (c) in Step 3.2, there is a mechanism arranged to recruit two extra hidden nodes with a gain parameter X whose value is initially one set from Action (a) in Step 3.2. These two newly-added hidden nodes, the p - lth and pth ones, have weights zw;-,T_I = (C - x LYTkX, AaT), zWpT = (5 + XCVTkx, -ACUT), C = 10e6 rnin,e~(k-i) jcrT(kx &)I, 32up-1 = 3wp = (tanh-‘(kt) - 3wo - Cfzf sw&kx, zwi))/(2 tanh(<)), where the length of vector CYis one and cy’(kx - =x) # 0, Vc E I(k - 1). For Action (d) in Step 3.2, the X value is fixed and the weight-tuning mechanism is applied to render goal (4) accomplished. If an unacceptable result is obtained, we multiply the J! value by 2, and repeat Actions (b)-(d). Actions (b)-(d) are repeated until goal (4) is achieved. Recruiting two extra hidden nodes has introduced two extra dimensions in the {h} space, and has thus, introduced two extra dimensions in the internal representation, The arrangement of zwp-r and zw, has put the new corresponding point h(kx, zw) on the each positive side of these two newly-added dimensions, and all other new corresponding points h(,x, 2~)s on the positive side of one newly-added dimension and on the negative side of the other newly-added dimension. A large X value makes the behavior of the activation functions of the newly recruited p - lth and pth hidden nodes similar to the behavior of a threshold function, and results in the phenomenon that h&x, zwp-1) and &x, zwp) numerically equal 1 or -1 for almost all c E I( Ic- 1). Thus, as stated in Lemma 1, the arrangement of two such newly-added hidden nodes with a large X value has put the new corresponding point h(kx, zw) near the (tanh(C), tanh(<)) position of these two newly-added dimensions, and all other new corresponding points h(,x, 2~)s near the (-1,l) corner of these two newly-added dimensions. Therefore, as mentioned in Lemma 2, if the internal representation renders (O(Cx, w) - .tl < E, for all c E I(k - 1) and (O&X, W) - ktl 2 E, the new internal representation can be appropriate for all k training samples after such two hidden nodes are recruited. In fact, Lemma 2 reveals that goal (4) can be accomplished immediately merely by recruiting extra hidden nodes with proper weights and X, and that only two of these extra hidden nodes are needed. Moreover, there is no infinite loop in Step 3.2. Therefore, Step 3.2 ensures that the goal of each stage is achieved at the end of each stage, thus, guaranteeing an acceptable learning result in the end. The recruiting mechanism handles the occurrence of an unacceptable result without involving the reason. A defective network structure triggers the recruiting mechanism; convergence on an undesired attractor also triggers the recruiting mechanism. The triggering of the recruiting mechanism due to the convergence on an undesired attractor may recruit excess hidden nodes that become irrelevant later. At each stage, a hidden node is irrelevant if goal (4) is still accomplished with this hidden node deleted. The irrelevant hidden nodes are useless with respect to goal (4); furthermore, they may contribute significant effort to the performance of the network and result in poor generalization. In addition, more samples typically produce more concise information about the appropriate internal representation, and thus, fewer hidden nodes are required. It is accordingly necessary to prune irrelevant hidden nodes, and an internal representation is better if it reaches goal (4) with a smaller amount of adopted hidden nodes. In Step 4, a reasoning mechanism is arranged to prune all potentially irrelevant hidden nodes. At the kth stage, a hidden node is potentially irrelevant if it is deleted and goal (4) can be accomplished by applying the weight-tuning mechanism. In Step 4, every hidden node is checked whether it is potentially irrelevant. Each potentially irrelevant hidden node is deleted after being identified. c. 3. THE PERFORMANCE OF THE PROPOSED. AND ANALYSIS ALGORITHM. Here, I use two popular examples to examine how the current arrangements. for the recruiting.

(9) Internal. 347. Representation. and reasoning mechanisms work: the encoding problem [16,17] and the parity problem [lS]. Ackley et al. in [17] has posed the encoding problem where a set of N orthogonal input patterns is mapped to a set of N orthogonal output patterns through a small set of hidden nodes. Such a problem requires a rather efficient way in encoding an N bit pattern into a small set of hidden nodes and then decoding this (internal) representation into the output pattern. Rumelhart et al. in [16] has proposed that a set of N orthogonal input patterns is mapped to a set of N orthogonal output patterns through a small set of log, N hidden nodes. The reason behind such a design is that if the hidden nodes take on binary values, the hidden nodes must form a binary number to encode each input pattern. The authors of [16] present an encoding problem with eight input patterns, eight output patterns, and three hidden nodes, and find the learning system develops solutions that use the intermediate values, as shown in Table 2. The result of simulation is encouraging: the proposed algorithm employs the intermediate values in a more efficient way. Table 3 shows the mapping generated by the proposed algorithm. The proposed algorithm also utilizes intermediate values to gain an appropriate internal representation. The appropriate internal representations shown in Tables 2 and 3 are similar, except that the former uses three hidden nodes and the latter two hidden nodes. It seems that the reasoning mechanism effectively prunes a potentially irrelevant hidden node, which is likely to be the middle one in Table 2. Table 2. The mapping Input Pattern. of the encoding problem generated in [16]. Hidden Node Pattern. Output. Pattern. 10000000. E 00100000 00010000. ooooiooo 00000100 0000001Q I00000001. 4. 4 4. 0 1 0. 1 1 1. 0 0 1. -+ -+ +. 00000100 00000010 00000001. Table 3. The mapping of the encoding problem generated by the proposed algorithm. 00000001. ---). -0.6150. -0.8390. --P. 00000001. Table 4 shows the mapping of the four-bit parity problem obtained in [lS] and generated by the proposed algorithm. Rumelhart et al. in (161 has proposed m hidden nodes required for the m-bit parity problem, and noted that “the internal representation created by the learning rule is to arrange that the number of hidden units that come on is equal to the number of zeros in the input and that the particular hidden units that come on depend only on the number, not on which input units are on”, see [16, p. 3351 By contrast, as shown in Table 4, the appropriate internal representations developed in [lS] and by the proposed algorithm look similar, except that the former uses four hidden nodes and the latter uses three hidden nodes..

(10) 348. R. Table 4. proposed. Input. The mapping algorithm.. of the. Hidden Node Generated. Pattern. four-bit. TSAIH. parity. Pattern in (161. problem. generated. in [16]. and. Hidden Node Pattern Generated by the Proposed Algorithm. by the. Output. Pattern. 0000. --+. 1111. 0.8934. -0.9999. -0.9998. --+. 0. 1000. -f. 1011. 0.5254. -0.9371. -0.9972. -+. 1. I. 0100. I-I. 1011. t. 0.5252. -0.9369. -0.9972. I + I. 1. I. I. 0010. I-I. 1011. 1. 0.5252. -0.9367. -0.9972. 14. 1. I. I. I. I. 0001. -t. 1011. 0.5251. -0.9366. -0.9972. --t. 1. 1100. -t. 1010. -0.2646. 0.9338. -0.9630. +. 0. 1010. -+. 1010. -0.2647. 0.9340. -0.9630. --+. 0. 1001. -+. 1010. -0.2648. 0.9341. -0.9630. -+. 0. 0110. -+. 1010. -0.2649. 0.9343. -0.9630. --t. 0. 0101. I-I. 1010. 1 -0.2650. 0.9344. -0.9630. 0011. +. 1010. -0.2651. 0.9345. -0.9630. --+. 0. 1110. +. 0010. -0.8097. 0.9999. -0.5882. -+. 1. 1101. -+. 0010. -0.8097. 0.9999. -0.5881. +. 1. 1011. +. 0010. -0.8097. 0.9999. -0.5881. --t. 1. -0.8098. 0.9999. -0.5879. -+. 1. 1 -0.9626. 1.0000. 0.5627. 1 --t (. 0. 0111. -+. 1111. I-+I. 0010. 0000. In sum, it seems that the current arrangement the proposed algorithm works so that the internal representation developed by the generalized delta that an appropriate internal representation with better.. 4. DISCUSSIONS. I-. 1. 0. 1. 1. of the recruiting and reasoning mechanisms in representation evolves in a better way than the rule, a result that is baaed upon the reasoning a smaller amount of adopted hidden nodes is. AND. FUTURE. WORK. In this paper, a deterministic learning algorithm that guarantees an acceptable learning result is proposed. The proposed algorithm does not follow the ideas of genetic inheritance and natural selection. During the process of the proposed algorithm, the internal representation evolves in a deterministic way into an appropriate one. The key mechanisms of the proposed algorithm are the recruiting mechanism and the reasoning mechanism. I provide theoretical justification to explain why the proposed algorithm guarantees an acceptable learning result. The experimental findings demonstrate that the proposed algorithm also employs an ability to develop an internal representation similar with the one shown in [16]. This is not surprising since the proposed algorithm also adopts in its weight-tuning mechanism the generalized delta rule proposed in [4]. However, the experimental results indicate that the current arrangements of the recruiting and reasoning mechanisms in the proposed algorithm work quite well that it enables the internal representation to evolve in a superior way than the representation developed by the generalized delta rule. There is a one caveat: when we apply the ANN to practical problems, a black box is obtained from the learning. Further study should attempt to explore the appropriate internal representation obtained from the learning in order to provide the knowledge behind the application, and thus, remove this caveat. The results of simulation show that the number of adopted hidden nodes is a function of the sample input sequence. As expected, some sample input sequences cause difficulties in the associated processes due to encountering a defective network structure or converging on an undesired attractor. Further investigation (theoretical and numerical) on the surface defined by Ek(w) over.

(11) Internal. Representation. the weight space is currently under way to identify the scenario (the sample input sequence and the value of X for which the weight-tuning mechanism will (or will not) achieve goal (4). Another study involves exploring a more efficient recruiting mechanism to recruit hidden nodes and a more efficient reasoning mechanism to prune potentially irrelevant hidden nodes.. APPENDIX THEORETICAL. SUPPORTS. AND THEIR. PROOFS. 1. Let {cx ( ‘dc E I(k)} be given, and tlSSume ZW&~ = (C - XaTkx,XcrT),zw~ = (~+XCY~~X, --AaT), where X is a given large number. Then, there exists a unit vector cx and a tiny positive number C that render I&x, 2wp-1) +/I(~x, 2wp) G 0.0. (Hereafter, F(z) % y means that! numerically, the value of F(z) is y.) Vc E I(k - 1) and I&x, 2wp-l) + h(kx, 2wp) = 2 tanh((‘). PROOF. Because lwp-lT = (~-Xc&X, AaT), zwp-lo+~& gL+lj CXcj = <+X(d+ c&x). Similarly, 290 + ~& 2wpj Czj = C + X(aTkx - aTCx).{,x / Vc E I(k)} is given, so k:x -- ,.x i: known for every c E I(k - 1). With a reasonable assumption that the amount of samples is finite, i.e., I(k) is a finite set. there exists a vector CYthat the length of a is one and aT(kx - =x) # 0, Vc E 1(/c -- 1). Then, < is assigned as 10V6 min,EI(k-I) laT(kx - ,x)1. Because X is a large value and < = 10e6 min,Er(k-I) laT(kx - .x)1, h(cx, 2wp-1) == tanh(j -t X(CYTCx - CY~~X)) 2 tanh(X(aT,x - CX~~X)) and h(,x,2wp) = tanh(< + X(cz”kx -- o~,x)) % - tanh(X(a’ ,x - aTkx)). Thus, I&x, 2wp-1) + /&x, 2wp) 2 0.0, Vc E I(rC - 1). + h(kx, 2Wp) = 2 tanh(C) h(kX, 2Wp-1) = h(kx, 2Wp) = tanh(C), SO h(kx, 2Wp-1) I LEMMA. LEMMA 2. Assume O(,x,w) = tanh(swo + Cflf gwih(c~, 2~i)), 10(=x, w) - ,tj c E b c c I (k-11, and PC kx, w) - rctl 1 E. When the following two hidden nodes are recruited. h(,:x, 2wp- 1) FZtanh(2wp-10 + ~~=, zw,-lj &cj, and h(Cx, 2wp) = tanh(zw,,a + C,“=, 2wpJ Cz3). the new valur ofO(,x,w) equals tanh(3wo+Cyzf NJ&X,ZW~) +~w~--lh(~x, 2wp-1) +~w&(~x,~w~)). Then. there exist 2wp-..1, 2wp, 3wp-1 and 3wp that render the new value of 10(Cx, w) -,tl < E, v’c E I(k), PROOF. Let Cy’ and =y be the values of O(Cx, w) before and after the introduction of t.wo hidden nodes, respectively. Also let &et be the value of 3wo + CyzF 3~ih(cx, 2w,) before the jnt,roduction of two hidden nodes. Thus, =y’ = tanh(,net) and =y = tanh(,net + 3~p-1 h(‘,x. !w~~.-~ ) + 32u,h(cx,. 2Wp)).. Because lO(,x,w) - &( < E, Vc E I(k - 1) and Io(kx, w) - ktl 2 E before introducing t,wo hidden nodes, Icy’ - &I < E, Vc E I(k - 1) and Iky’ - ktl 2 E. Let X,2wp-1 and 2wp be assigned as in Lemma 1. Thus, from Lemma 1, h(,x, 2wp-l) + h(cx, 2Wp) z 0.0, 'dc E I(lc-1) and h(kx, 2wp-l)+h(kx, 2wp) = 2 tanh(C). Let 3wp-1 :=.gwp -= y\ where y = (tanh-‘(kt) - knet)/(2tanh(<)). Since 3wp-1 = 320~and h(cx,~~p-1)+h(c~,2~p) 2 0.0, Vc E I(lc-l),,y Z Cy’, VJ f- [(k-l). Therefore, Icy - ct( < E, Vc E I(k - 1) since Icy’ - &I < E, Vc E I(k - 1). so ,+y = tanh(knet + 72tanh(<)) = kt. Thus, /k. - ktj h(kx, 2Wp-1) + h(k“, 2Wp) = 2 tanh(C), < E. I. REFERENCES 1. S. Blank, Chaos in futures market? A nonlinear dynamical analysis, J. Futures Markets 11, 711- 728, (1991j. 2. G. DeCoster, W. Labys and D. Mitchell, Evidence of chaos in commodity futures prices, J. &lures Markets 12, 291-305, (1992). 3. G. Grudnitski and L. Osburn, Forecasting S & P and gold futures prices: An application of neural networks. J. Futures Markets 13, 631-643, (1993). 4. D. Rumelhart, G. Hinton and R. Williams, Parallel Distributed Processzng: Explorations zn the Mzcrostructure of Cognition, Volume 1, (Edited by D. Rumelhart and J. McClelland). pp. 318-362. MIT Press. Canlbridge, MA, (1986). 5. J. Hutchinson, A. Lo and T. Poggio, A nonparametric approach to pricing and hedging derivative securities via learning networks, .I. Finance 49 (3), 851-889, (1994)..

(12) 350. R. TSAIH. 6. R. Tsaih, Y. Hsu and C. Lai, Forecasting S&P 500 stock index futures with the Hybrid AI system, Deczszon Support Systems 23 (2), 161-174, (1998). 7. R. Tsaih, An explanation of reasoning neural networks, Mathl. Corn@. Modelling 28 (2), 37-44, (1998). 8. X. Yao, A review of evolutionary artificial neural networks, Inlernational Journal of Intelligent Systems 8 (4), 539-567, (1993). 9. M. Me’zard and J. Nadal, Learning in feedforward layered networks: The tiling algorithm, Journal of Physics A 22, 2191-2204, (1989). 10. S. Fahlman and C. Lebiere, The cascade-correlation learning architecture, In Advances in Neural Informatzon Processing Systems II, (Edited by D. Touretzky), Morgan Kaufmann, San Mateo, CA, (1990). 11. M. Frean, The upstart algorithm: A method for constructing and training feedforward neural networks, Neural Computation 2, 198-209, (1990). 12. E. Watanabe and H. Shimizu, Algorithm for pruning hidden nodes in multi-layered neural network for binary pattern classification problem, Proc. International Joint Conference on Neural Networks I, pp. 327-330, (1993). 13. Y. Chen, D. Thomas and M. Nixon, Generating-shrinking algorithm for learning arbitrary classification, Neuml Networks ‘7, 1477-1489, (1994). 14. R. Tsaih, The softening learning procedure, Mathl. Comput. Modelling 18 (8), 61-64, (1993). 15. R. Tsaih, The reasoning neural networks, In Mathematics of Neural Networks: Models, Algorithms and Applications, (Edited by S. Ellacott, J. Mason and I. Anderson), pp. 366-371, Kluwer Academic, London, (1997). 16. D. Rumelhart, G. Hinton and R. Williams, Learning internal representations by error propagation, In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1, (Edited by D. Rumelhart and J. McClelland), pp. 335-337, MIT Press, Cambridge, MA, (1986). 17. D. Ackley, G. Hinton and T. Sejnowski, A learning algorithm for Boltzmann machines, Cognitive Science 9, 147-169, (1985)..

(13)