Classifying set and interval data - Implications of the Dirichlet assumption for discretization

This section presents an extension of lazy discretization method that allows a naive Bayesian classifier to classify a query vector containing set or interval values.

Table 2. Performance comparison of lazy and fast-lazy discretization methods in terms of the classification

7.1. Training and classification for set and interval data

In previous work of naive Bayesian classifiers, variables usually have point values in the sense that their values are atomic. In fact, a variable may have a set value if its value is a set, or have an interval value if its value is an interval. If a set-valued variable whose value contains interval members, then we call this variable a multi-interval variable. For example, variable x1= {red, yellow, green} has a set value, x2= [20.5, 38.5] has an interval value, and x3= {[20.5, 38.5], [49.6, 60.4]} has a multi-interval value.

In many situations, it is more appropriate to model a variable as set-valued or interval-valued than as point-interval-valued. One of such situations is when the variable represents a time frame within a certain period of time in temporal reasoning. The other example is when each query vector represents the observation of a collection instead of an individual. Interval values are also more appropriate when the data is obtained from a sensor that returns an interval instead of a point value. The interval can be formed by the maximum and minimum values detected by the sensor.

To classify a query vector that contains an interval-valued variable, no discretization is necessary because we can simply estimate the probability of the interval from training data set, as if the interval were a discretized interval in lazy discretization. We can also classify set-valued and multi-interval valued data in a similar manner. Let x be the variable that has

Table 3. Accuracies of naive Bayesian classifier using three well-known discretization methods and using pa-rameter estimation assuming normal.

Data set Ten bins Entropy Bin-log l Normal

Australian 85.22± 2.73 85.94± 0.98 84.78± 1.94 77.10± 2.27⁻ Hypothyroid 97.12± 0.56 98.29± 0.52 97.00± 0.43 97.78± 0.36 Iris 94.67± 3.40 95.33± 1.63 95.33± 3.40 96.00± 3.89

Table 4. Required differences of Bonferroni multiple comparisons between the lazy methods and the well-known methods. class-conditional density of x given class c is p(x ∈ Im| c). By perfect aggregation (Theorem 1) and Lemma 2, we can estimate ˆp(x ∈ Im| c) by

ˆp(x∈ Im| c) = αm+ ycm

α + nc

(9)

where ycm is the number of class c examples in the training data set whose x ∈ Im, and αm= α1+ α2+ · · · + αk. Equation (4) is a generalization of Eq. (3) and can be applied to query vectors that contain discrete or set-valued data. Note that as long as we can compute ycm, we can apply this equation regardless of the data type of x in the training data set. That is, in the training data set, x can be point-valued, set-valued, or interval-valued.

7.2. Experimental results

Since there is no set-valued or interval-valued data in the UCI ML repository (Blake & Merz, 1998), we developed a technique to generate interval-valued data from point-valued data to observe the effectiveness of this new method. The idea is to form intervals by selecting pairs of point-valued continuous data with the same class label such that the difference of the val-ues in each pair is within a certain range. By selecting an appropriate range for each data set, classifying interval data can yield much better accuracies than classifying point-valued data.

We used the ten “pure” continuous data sets from UCI ML repository to evaluate our ap-proach empirically and compare its performance with the well-known discretization meth-ods. Suppose we have two query vectors with the same unknown class label and x is a continuous variable in these vectors. Suppose the values of x for these vectors are X1and X2, respectively, and X1≤ X2. Then the x value of the merged vector is [X1, X2]. The idea is that if we have two different samples of the same unknown class, then it is very likely that the values of the feature of the samples in this class are within the interval bounded by the values from the two samples. For example, suppose we pick up two leaves dropped from a tree of an unknown genus and we want to use the features of the leaves, such as their length, to recognize the genus. If the length of the two leaves are 16.57 and 17.32, respectively, then we can assume that the length of the leaves of this genus is within the interval [16.57, 17.32].

However, the width of the intervals created by merging should be constrained. If two values are too close and the merged interval becomes too narrow, the number of the examples in the training data set in the interval will be too small to allow accurate estimation of ˆp(x∈ I | c). Therefore, we will set a minimum threshold for the interval width. If the width of the interval created by merging is smaller than the threshold, we will extend both ends of the interval to reach the minimum threshold.

On the other hand, if the difference between the two values are too large and the merged interval become too wide, the interval may contain decision boundaries and result in mis-classification. Consider a simple classification task that involves a single variable x and two classes c1and c2. Suppose we have two data D1and D2that have the same class label c₁ and their x values are marked in figure 13 which also shows the curves of the “true”

densities of p(x| c). If we classify D1and D₂individually based on the densities, both D₁ and D₂ will be correctly classified to class c₁. If we merge D₁ and D₂, we will create an interval I as shown in figure 13. The merged data will be incorrectly classified to c₂because p(x∈ I | c1)< p(x ∈ I | c2) (illustrated as the shaded areas in figure 13). Therefore, we must set a maximum threshold for the width of the merged intervals. If the width is larger than that threshold, we will partition the interval into multiple intervals with a pre-determined width SI. In the case of figure 13, we will form two intervals I1and I2that contain D1and D2respectively. The merged query vector will have a multi-interval value{I1, I2}.

Figure 13. Merging two widely separated samples may result in misclassification.

In the experiment, the thresholds are determined empirically as follows. The threshold of the minimum width for each variable is equal to the width of “fifty-bin” for large data sets (size> 350), “twenty-bin” for medium-sized data sets (size between 350 and 150), and

“fifteen-bin” for small data sets. The threshold of the maximum width is equal to “four-bin”, because we investigated the experiments in Section 5.2 and concluded that when the number of bins is greater than four bins, the accuracies will reach a plateau for most data sets. The value of SI is equal to the threshold of the minimum width of the variable.

For each data set selected for the experiment, we randomly chose twenty percent of data from each class for testing and the remainder for training. Then we randomly picked pairs of data from the test set and merged them into interval data by applying the method described above. The merged data are classified by the “lazy interval” approach described in Section 7.1 and the resulting accuracy is reported. This procedure is repeated ten times to obtain the average and the standard deviation of the accuracies for each data set. Table 5 gives the results. We also extract the best results for each data set from Tables 2 and 3 for comparison.

For all data sets, “lazy interval” outperforms the best result of discretization significantly.

This result reveals that the “lazy interval” approach is useful for classifying interval data.

The improvement of the classification accuracies by the lazy interval approach can be attributed to the pre-processing step that merges the point-valued data into the multi-interval data. The merging reduces the probability of “Bayes error.” Bayes error occurs when a query vector is classified to a wrong class when the Bayes decision rule is applied. The region of Bayes error for a class C and a variable x is the region in the domain of x where Bayes error occurs.

Given a pair of query vectors to be merged, their values can be either inside or outside of the region of Bayes error. Therefore, we have three possible combinations. The first combination is that both query vectors are outside of the region of Bayes error. Then the merged interval data will be correctly classified. In the second combination, we have both query vectors in the region of Bayes error, and the classification result will be incorrect. Both

Table 5. Comparison of the accuracies achieved by the naive Bayesian classifier that classifies merged interval data and the best accuracies achieved by the naive Bayesian classifier that classifies point data.

Data set Lazy interval The best of Table 2 and 3

Breast-W 98.60± 1.31 96.00± 3.08

German 78.40± 2.76 75.30± 1.42

Glass 67.39± 6.53 67.12± 3.60

Iris 100.00± 0.00 97.00± 4.33

Liver 68.29± 5.49 65.22± 2.25

PIMA 81.71± 4.22 76.18± 4.84

Vehicle 64.22± 5.36 62.29± 2.51

Waveform 89.45± 2.52 82.17± 2.87

Wine 99.44± 1.67 97.78± 1.67

Yeast 64.66± 3.04 58.72± 2.85

Average 81.22 77.78

of the above combinations yield the same result as classifying the data without merging.

In the third combination, one of the query vector is in the region while the other is not. In this case, without merging, one query vector will be classified correctly while the other will be misclassified. But with merging, it is more likely that the merged query vector will be correctly classified, and as a result, the classification accuracy will be boosted.

We can show that this is the case. Consider again a simple classification task that involves a single variable x and two classes c1and c2with their p.d.f. curves given in figure 14. D1

and D2are two query vectors of class c1. D2is in the region of Bayes error but D1is not.

Figure 14. The area under the p.d.f. curve in the region of Bayes error is usually small and therefore merging can improve the classification accuracy.

If we classify each of them based on their conditional densities, D₂will be misclassified.

However, if we merge D₂with D₁, the classification will be based on p(x ∈ [D1, D2]| c1) and p(x ∈ [D1, D2]| c2), the areas between the interval [D₁, D2] under the curves of p(x| c1) and p(x| c2) respectively. For x value in the region of Bayes error, such as D₂, since D2 is supposed to be of class c1, it is likely that both p(x| c1) and p(x| c2) are low and their difference is small. On the other hand, if x is outside of the region, such as D1, then it is quite likely that p(x| c1) p(x | c2). Therefore, chances are that we have p(x ∈ [D1, D2]| c1) > p(x ∈ [D1, D2]| c2), as illustrated in figure 14, and hence the merged query vector will be classified correctly. A counterexample is the pair D2and D3. Merging this pair will yield incorrect result. But it is unlikely that we select such a pair of query vectors because p(x = D3| c1) must be relatively low.

8. Conclusions

This paper reviews the properties of Dirichlet distributions and describes its implications and applications to learning naive Bayesian classifiers. The results can be summarized as follows:

– Perfect aggregation of Dirichlets ensures that a naive Bayesian classifier with discretiza-tion can effectively approximate the distribudiscretiza-tion of a continuous variable.

– A discretization method should partition the domain of a continuous variable into intervals such that their cut points are close to decision boundaries and their width is sufficiently large. It turns out that this requirement is not difficult to achieve the above requirement for many data sets and as a result, a wide variety of discretization methods can have similar performance regardless of their complexities.

– We presented a new lazy discretization method, which is derived directly from the prop-erties of Dirichlet distributions, and showed that it works equally well compared to well-known discretization methods. These results justify the selection of Dirichlet priors and verify our analysis.

– We also extend the method to allow a naive Bayesian classifier to classify set and interval data.

We plan to investigate whether our analysis can provide any new insight on handling continuous variables in general Bayesian networks as our future work.

Acknowledgments

The research reported here was supported in part by the National Science Council in Taiwan under Grant No. NSC 89-2213-E-001-031.

Notes

1. Though the document of heart states that all of its attributes are continuous, attribute 2, 3, 6, 7, 9, 11, 12, 13 are actually binary or discrete.

2. The accuracies were obtained from five-fold cross-validation tests where continuous variables are handled by the lazy ten-bin discretization method described in Section 6.

References

Almond, R. (1995). Graphical Belief Modelling. New York: Chapman and Hall.

Azaiez, M. N. (1993). Perfect aggregation in reliability models with Babyesian updating. Ph.D. thesis. Department of Industrial Engineering, University of Wisconsin-Madison, Madison, Wisconsin.

Blake, C., & Merz, C. (1998). UCI repository of machine learning databases.

Cestnik, B., & Bratko, I. (1991). On estimating probabilities in tree pruning. In Machine Learning—EWSL-91, European Working Session on Learning (pp. 138–150). Berlin, Germany: Springer-Verlag.

Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss.

Machine Learning, 29, 103–130.

Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. In Machine Learning: Proceedings of the 12th International Conference (ML ’95). San Francisco, CA:

Morgan Kaufmann.

Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. New York: Wiley and Sons.

Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence (IJCAI ’93) (pp. 1022–1027).

Heckerman, D. (1998). A tutorial on learning with Bayesian networks. In M. I. Jordan (ed.), Learning in Graphical Models (pp. 301–354). Boston: Kluwer Academic Publishers.

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–90.

Iwasa, Y., Levin, S., & Andreasen, V. (1987). Aggregation in model ecosystem: Perfect aggregation. Ecological Modeling, 37, 287–302.

John, G., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI ’95) (pp. 338–345).

Kohavi, R., & Sahami, M. (1996). Error-based and entropy-based discretization of continuous features. In Proceed-ings of the Second International Conference on Knowledge Discovery and Data Mining (KDD ’96) (pp. 114–

119). Portland, OR.

Langley and Thompson. (1992). An analysis of Bayesian classifier. In Proceedings of the 10th National Conference on Artificial Intelligence (pp. 223–228). Portland, OR: AAAI Press.

McClave, J. T., & Dietrich, F. H. (1991). Statistics. San Francisco: Dellen Publishing Company.

Monti, S., & Cooper, G. F. (1999). A bayesian network classifier that combines a finite mixture model and a naive bayes model. In The fifteenth Conference on Uncertainty in Artificial Intelligence (UAI ’99) (pp. 447–456).

Spector, P. (1990). An Introduction to S and S-PLUS. Duxbury Press.

Wilks, S. S. (1962). Mathematical Statistics. New York: Wiley and Sons.

Wong, T.-T. (1998). Perfect aggregation in dependent Bernoulli systems with Bayesian updating. Ph.D. thesis, Department of Industrial Engineering, University of Wisconsin-Madison, Madison, Wisconsin.

Received November 9, 2000 Revised May 17, 2002 Accepted May 17, 2002 Final manuscript May 17, 2002

在文檔中 Implications of the Dirichlet assumption for discretization of continuous variables in naive Bayesian classifiers (頁 22-29)