• 沒有找到結果。

3. The proposed dual approach

3.2 Modeling phase

Table 4 presents the modeling phase, in which the classification rule is set up.

Despite FT and NFT are resulted from fraud and non-fraud samples respectively, the spatial relationship hypothesis and the same setting of (τ1 and τ2) parameters may render true that each leaf node of NFT has one or more than one counterpart leaf nodes in FT and vice versa. Thus, one purpose of the modeling phase is to match each leaf node of FT to its counterpart leaf nodes in NFT and vice versa.

Table 4. The modeling phase.

step 1: For each leaf node of FT,

i. calculate and store its Avgx value that is the average of Euclidean distances between the weight vector and the grouped fraud training samples;

ii. calculate and store its Stdx value that is the standard deviation of

Euclidean distances between the weight vector and the grouped fraud training samples.

step 2: For each leaf node of NFT,

i. calculate and store its Avgy value that is the average of Euclidean distances between the weight vector and the grouped non-fraud training samples;

ii. calculate and store its Stdy value that is the standard deviation of Euclidean distances between the weight vector and the grouped non-fraud training samples.

step 3: For each training sample,

i. identify and store the winning leaf node of FT and the winning leaf node of NFT, respectively;

ii. store its Avg values of the winning leaf nodes of FT and NFT, respectively;

iii. store its Std values of the winning leaf nodes of FT and NFT, respectively.

iv. calculate and store its Dft, the Euclidean distance between the training sample and the weight vector of the winning leaf node of FT;

v. calculate and store its Dnft, the Euclidean distance between the training sample and the weight vector of the winning leaf node of NFT.

step 4: Create the spatial correspondence tables regarding the matching from NFT to FT and from FT and NFT, respectively.

step 5: Use the fraud-central rule defined in Equation (3) and the optimization problem (4) to determine the parameter β1p that minimizes the corresponding sum of (type I and type II) classification errors.

step 6: Use the non-fraud-central rule defined in Equation (7) and the optimization problem (8) to determine the parameter β2p that minimizes the corresponding sum of (type I and type II) classification errors.

step 7: Pick up the dominant classification rule via comparing the classification errors obtained in step 5 and step 6.

step 8: For each leaf node of FT, apply PCA to select features through extracting factors (i.e., principle components).

step 9: For each leaf node of FT, analyze the common fraud features from exogenous information based on the associated domain categories.

3.2.1 Statistic-gathering module

The statistic-gathering module is executed via the step 1, step 2 and step 3 of Table 4. After NFT and FT are constructed, a non-fraud-central rule and a fraud-central rule are tuned respectively via inputting all samples to determine the adjustable discrimination boundary within each leaf node of the NFT and FT. The optimization renders rules for detecting fraud samples are adjustable and effective. The decision maker can objectively set his/her weightings of type I and type II errors. The rule associated with the tree that dominates another is adopted as the classification rule to classify whether samples are fraud or non-fraud.

In step 1, the Avgx value (i.e., the average of Euclidean distances between the weight vector and the grouped fraud training samples) and the Stdx value (i.e., the standard deviation of Euclidean distances between the weight vector and the grouped fraud training samples) of each leaf node of FT are calculated and stored. Similarly, in step 2, the Avgy value (i.e., the average of Euclidean distances between the weight vector and the grouped non-fraud training samples) and the Stdy value (i.e., the standard deviation of Euclidean distances between the weight vector and the grouped non-fraud training samples) of each leaf node of NFT are calculated and stored.

Hereafter, we use #x to denote the xth leaf node of FT and *y the yth leaf node of NFT.

In step 3, we collect and store the following information regarding each training sample: the winning leaf node of FT, the winning leaf node of NFT, the corresponding Avgx and Stdx values of the winning leaf node of FT, the corresponding Avgy and Stdy

values of the winning leaf node of NFT, the Dft (i.e., the Euclidean distance between the training sample and the weight vector of the winning leaf node of FT), and Dnft (i.e., the Euclidean distance between the training sample and the weight vector of the winning leaf node of NFT). Following the GHSOM classification rule, we identify the winning leaf nodes of FT and NFT, respectively.

3.2.2 Rule-forming module

The rule-forming module is executed via the step 4 to step 7 of Table 4. In step 4, two spatial correspondence tables are created respectively based on the classification results of all (fraud and non-fraud) training samples. That is, from the NFT perspective, if the leaf node #x in FT hosts the majority of all training samples classified in the leaf node *y in NFT, then we match the leaf node #x in FT to the leaf node *y in NFT and claim that the leaf node #x in FT is the counterpart of the leaf node *y in NFT. The leaf node matching of #x to *y states the spatial relationship among the fraud and non-fraud samples classified in the leaf nodes of #x and *y. That is, if any sample is classified into the leaf node *y when using NFT, it is more likely to be classified into the leaf node #x when using FT. Similarly, from the FT perspective, if the leaf node *y in NFT hosts the majority of all of training samples classified in the leaf node #x in FT, then we match the leaf node *y in NFT to the leaf node #x in FT and claim that the leaf node *y in NFT is the counterpart of the leaf node #x in FT.

The fraud-central rule defined in Equation (7), in which β1p is a parameter for a pair p of leaf nodes (#FT match to *NFT), states that some non-fraud samples cluster around a subset of fraud samples. That is, for the (fraud or non-fraud) sample c that is classified into the leaf node #x of FT, if Dcft is smaller than the value of Avg + cx β1p

× Stdyc, the sample c will be classified as the fraud one; otherwise, the non-fraud one.

Because the discrimination boundary (i.e., Avg + cx β1p × Stdcy) is data-dependent, the parameter β1p needs to be tuned to find the optimal discrimination boundary.

Therefore, in step 5, we use the optimization problem (8) to determine the parameter

p

β1 . In the optimization problem (8), the sets SF and SNF are given. For each c, the values of Dcft, Avgcx, and Stdcy, are also given. In the objective function, there are coefficients w1 (the weighting of type I error) and w2 (the weighting of type II error) that are constants subjectively determined by the decision makers in terms of their preference of the classification performance. In general, there are three kinds of settings for (w1, w2) — (1, 1), (0.01, 1), (1, 0.01) regarding the minimizations focusing on the average sum of type I and type II errors, mainly the type II error, and mainly the type I error, respectively.

The fraud-central rule: If (Dcft < Avg + xc β1p × Stdcy), the sample is classified as the fraud one; otherwise, the non-fraud one. (7)

1

From the definition of ic, (ic)2 equals 1. Thus, the objective function can be refined as Equation (9) and, effectively, we only need to minimize

following enumeration scheme can be used to determine the optimal values of β1p. Note that all Stdyc are strictly positive. Thus we have: minimizes the value of

The non-fraud-central rule defined in Equation (11), in which β2p is a parameter for a pair p of leaf nodes (*NFT match to #FT), states that some fraud samples cluster around a subset of non-fraud samples. That is, for the sample c that is classified into the leaf node *y of NFT, if D is smaller than the value of nftc Avgcy + β2p × Std , xc the sample c will be classified as the non-fraud one; otherwise, the fraud one. The parameter β2p also needs to be tuned to find the optimal discrimination boundary (i.e.,

c

Avgy + β2p × Std ). Therefore, in step 6, we use the optimization problem stated in xc (8) to determine the parameter β2p through the minimization of the sum of (type I and type II) classification errors. In the optimization problem (12), the sets SF and SNF are given. For each c, the values of D , nftc Avg , and cy Std ,are also given. The constants xc w1 and w2 in the objective function are set as the same values as in optimization problem (8).

The approach for solving the optimization problem (8) is also applied for solving the optimization problem (12) to get the optimal range of β2p that minimizes the value of

In step 7 of Table 4, the picked classification rule is the fraud-central rule if the sum of classification errors resulted in step 5 is smaller than the one resulted in step 6;

otherwise, the non-fraud-central rule. The dominance of the non-fraud-central rule leads to an implication of the spatial relationship among fraud and non-fraud samples that most of fraud samples cluster around the non-fraud counterpart. The dominance of the fraud-central rule leads to an implication of the spatial relationship among fraud and non-fraud samples that most of non-fraud samples cluster around the fraud counterpart.

3.2.3 Feature-extracting module

The feature-extracting module is executed via the step 8 of Table 4. For each clustered group based upon fraud samples, the feature-extracting module applies PCA

to select features or to extract factors (i.e., principle components) that link to fraud related features from exogenous information. It further represents the inherent variable features to reveal each group’s heterogeneity, and the purpose of feature selection is trying to exclude variables irrelevant to the modeling problem for a particular group.

Here we use PCA to do feature selection by selecting a set of variables which best represent the composited features of an investigated leaf node of the GHSOM clustering result based upon fraud samples.

The main objective of the PCA is to determine the important dimensions (characters) which can explain the input variable features of the analyzed samples, and can explore underlying patterns of relationship between the input variables. The input variables are same as the GHSOM input variable. The fraud/ non-fraud dichotomous variable is set to the dependent variable. Only those factors that account for variances greater than 1 (eigenvalue >1) are included in the model.This criterion is also called K1 method proposed by Kaiser (1960) and is probably the one most widely used.

According to this rule, only the factors that have eigenvalues greater than one are retained for interpretation. Factors with variance less than one are not better than a single ratio, since each ratio has a variance of 1.

The other objective of the PCA is to calculate factor scores for each of the sample according to the factors determined. Then, to enhance the interpretability of the factors, the varimax factor rotation method is used in PCA. This method minimizes the number of variables that have high loadings on a factor, and all factor loadings will be presented.

Here, variables with large loadings for the same factors are grouped and small factor loadings are omitted. Estimated factor represents a specific characteristic of firms under consideration. (Canbas et al., 2005)

The outcomes of feature-extracting module are several representative variables as the ‘variable pattern’ for each clustered group. Hence, from comparing the similarity of each selected features provided by PCA, we can efficiently exploit one single group or compare different groups. Besides, after determining the basic financial factors from training samples, early warning model can be estimated according to these obtained factors, such as discriminant, logit, probit, and Neural Network.

3.2.4 Pattern-extracting module

The pattern-extracting module is executed via the step 9 of Table 4. The exogenous information of the fraud behaviors beyond the financial numbers is used in this module.

Extracting the fraud categories of a certain investigated sample can help reveal more domain information. We can use any qualitative method to analyze the category of fraud from any available structural, semi-structural, or un-structural resource, such as news, reports, or other fraud-related content. First, the categories of fraud should be determined by the authentic reference. Then, for a leaf node of FT, using any qualitative way to classify the fraud categories of each samples belong to the leaf node. If the resource of fraud categories is structural, we only have to encode the class data as the other extracted feature.

相關文件