Bayesian Classication for Set and Interval Data

全文

(1)1. Bayesian Classi

(2) cation for Set and Interval Data Hung-Ju Huang Department of Computer and Information Science National Chiao-Tung University Hsinchu City 300, Taiwan gis85563@cis.nctu.edu.tw Abstract | Learning naive Bayesian classi

(3) ers is an important approach to probabilistic induction. However, no study has been done on naive Bayesian classi

(4) ers when a query vector includes interval-valued data, and little is known about how a set of query vectors from the same unknown class can be accurately classi

(5) ed. In this paper, we present a new training approach to the problems above. This approach is based on the \perfect aggregation" property of the Dirichlet distribution, which is usually assumed to be the prior of the variables in a Bayesian classi

(6) er. The experimental results show that when we merge an appropriate number of query vectors with the same unknown class and the interval-valued data are formed, the acccuracies of a trained naive Bayesian classi

(7) er can be promoted signi

(8) cantly. This paper also reports a successful application of our approach in speaker recognition. Key Words: Naive Bayesian Classi

(9) er, Interval Query, Machine Learning, Data Mining. I. Introduction. Chun-Nan Hsu Institute of Information Science Academia Sinica, Taipei City, Taiwan chunnan@iis.sinica.edu.tw classi

(10) er. But if the classi

(11) cation results are not the same, we will have to come up with a method to resolve the difference. Alternatively, we can merge each feature of the two leaves to form a new interval-valued data, and then classify the merged data. Consider a data set that comes from three di erent classes. We want to classify two query vectors. Assuming that we do not have any prior knowledge about the class distribution. Hence, suppose we randomly assign a class label to each vector. The expected accuracy will be (1 1=3 1=3 + 2 0:5 1=3 2=3 + 2 0 2=3 2=3 = 1=3). On the other hand, suppose we know that the two vectors belong to the same unknown class, and randomly assign their class. The expected accuracy will still be (1 1 1=3 + 1 0 2=3 = 1=3), regardless of our knowledge that they come from the same class. This shows that a classi

(12) er must deliberately take advantage of that knowledge or the knowledge will not improve the expected accuracy. In this paper, we present a method that allows a naive Bayesian classi

(13) er to have the abilities of processing interval-valued data. This method is then extended to classify merged query vectors when we know these vectors have the same unknown class label. The key of our approach is based on our study on the Dirichlet distribution and its properties. A discrete variable as well as a discretized continuous variable in a naive Bayesian classi

(14) er are usually assumed to have a Dirichlet prior. Perfect aggregation of Dirichlets implies that we can estimate the class-conditional probabilities of discretized intervals regardless of how other region of the domain of the continuous variable is discretized. Those are the reasons of why we can process multiple interval-valued data. The experimental results show that when we use interval data which were formed by merging an appropriate number of query vectors with the same unknown class label, the accuracy of a naive Bayesian classi

(15) er will be promoted signi

(16) cantly.. Learning naive Bayesian classi

(17) ers is an important approach to probabilistic induction. In spite of its simplicity, naive Bayes constantly outperforms competing algorithms in the experiments reported in the literature. Remarkably, in KDD-CUP-97, two of the top three contestants are based on naive Bayes [1]. Meanwhile, Domingos and Pazzani [2] reported an experiment that compared naive Bayes with several classical learning algorithms such as C4.5 with a large ensemble of real data sets. The result also showed that naive Bayes can signi

(18) cantly outperform those algorithms. Hence, the naive Bayesian classi

(19) er is becoming a popular tool for classi

(20) cation. Naive Bayes can handle discrete variables and continuous variables assuming that their priors are Dirichelet distribution and Normal distribution, respectively [3]. It has been shown that when a continuous variable is not normal, the performance will be inferior to discretization. Several researchers have studied how to handle continuous variables for Bayesian classi

(21) ers [3{7]. However, no study has been done on naive Bayesian classi

(22) ers when a II. Preliminary query vector including interval-valued data [8], and little is known about how a set of query vectors from the same A. Dirichlet Distribution and Perfect Aggregation unknown class can be accurately classi

(23) ed. The relation Random vector = (1 ; 2 ; : : : ; k ) has a k-variate between these two problems can be illustrated by the fol- Dirichlet with parameters j > 0 for j = lowing example. Suppose we pick up two leaves dropped 1; 2; : : : ; k +distribution 1 if it has density from a tree, we would like to use their features, such as Pk+1 their length, to classify what kind of tree they were on. k j =1 j ) Y j ;1 (1 ; ; ; ) k+1 ;1 In general, the length of the two leaves will not be equal. f () = ;( Qk+1 1 k j Conventionally, we can classify each individual leaf by a j =1 ;( j ) j =1.

(24) for 1 + 2 + + k 1 and j 0 for j = 1; 2; : : :; k. This distribution will be denoted Dk ( 1 ; 2 ; : : : k ; k+1 ). A Beta distribution is a univariate Dirichlet distributions and usually denoted Beta( 1 ; 2 ). critical to our approach. These properties greatly simplify the computation of the moments of the Dirichlet distribution in Bayesian analysis. Suppose random vector = (1 ; 2 ; : : : ; k ) has a Dirichlet distribution Dk ( 1 ; 2 ; : : : ; k ; k+1 ), then by [9] any subvector (n1 ; n2 ; : : : ; nm ) of has an mP -variate Dirichlet distribution Dm ( n1 ; n2 ; : : : ; nm ; ; m j =1 nj ). We call this subvector lemma. Also by [9], the sum of any subset Pf1 ; 2 ; : : : ; k g hasP a Beta distribution with parameters j2 j and ; j2 j ), where = 1 + 2 + + k+1 . This is called sum-of-subset lemma. Another important property of Dirichlets is that a Dirichlet distribution is conjugate to the multinomial sampling [10]. This property basically states that the posterior distribution of a Dirichlet given our observation is also a Dirichlet. Formally, let D = fy1; y2 ; : : : ; yk+1 g be a data set for the outcomes in n trials, where yj denotes the number of trials turning to be outcome j . When the prior distribution p() is a Dirichlet distribution with parameters j for j = 1; 2; : : : ; k + 1, and the likelihood function L(Dj) follows a multinomial distribution, then the posterior distribution p(jD) is also a Dirichlet distribution with parameters 0j = j + yj for j = 1; 2; : : : ; k + 1. Similarly, the Beta distribution is conjugate to the binomial sampling. The expected value for j given D is E [j jD] = j ++nyj , for j = 1; 2; : : : ; k + 1. This expression can be rewritten as follows: + yj = j + n yj E [j jD] = j + n +n +n n = wE [j ] + (1 ; w) ynj ;. f (qjD). We can

(25) rst derive the prior distribution for the probability of interest f (q) from the prior distribution of f (). Then we can convert the training data D into a new set of training data D0 in terms of q by computing the sum of the observations of our interest in D. Now, we can use D0 to update the prior distribution of f (q) to obtain the posterior distribution f (qjD0 ). We can show that in general it is always the case that. f (qjD) = f (qjD0 ). Since the multinomial likelihood function L(Dj) implies that the trials for obtaining data set D are all independent, the likelihood function L(D0 jq). will follow a binomial distribution. By sum-of-subset lemma, thePprior distribution of f (q) has a beta distribuP tion Beta( j2 j ; ; j2 j ) when has a Dirichlet distribution. Since the beta distribution is conjugate to the 0 binomial sampling, the posterior distribution P f (q jD ) will have a Beta distribution with parameters j2 ( j + yj ) P and + n ; j2 ( j + yj ). This is exactly the same as Equation (1). This property is always true for the Dirichlet distribution and is

(26) rst derived by [11], called \perfect aggregation" [12,13]. For example, suppose that we are interested in the probability of showing odd number in throwing a die. Let j be the probability that the die shows number j in a trial, and let yj be the number of trials that the die shows j in n trials. Then the probability of interest can be represented as q = 1 + 3 + 5 . In the straightforward approach, we derive the distribution f (qjD) from the data fy1; y2 ; : : : ; y6 g, while in the alternative approach, we can use D0 = fn; y1 + y3 + y5 g instead and will obtain the same result. B. Naive Bayesian Classi

(27) er A naive Bayesian network classi

(28) es a feature vector x by selecting class c that maximizes the posterior probability. where w = + n . Note that E [j ] and yj =n are the prior and the sample means of j , respectively. Hence, w and 1 ; w can be thought of as the weights of prior and sample means, respectively, and the weights for all j are all identical. This reveals the advantage that a Bayesian analysis considers both prior information and training data. Let be a subset of f1 ; 2 ; : : : ; k g, and let the probability of interest q be the sum of the variables in P ; i.e. , q = j2 j . Suppose the prior distribution of = (1 ; 2 ; : : : ; k ) is a Dirichlet distribution Dk ( 1 ; 2 ; : : : ; k ; k+1 ). A straightforward application of the Bayesian approach is to use the training data D to update the prior distribution to obtain the posterior distribution f (jD), which is a Dirichlet distribution Dk ( 1 + y1 ; 2 + y2 ; : : : ; k + yk ; k+1 + yk+1 ): Then the posterior distribution f (qjD) is derived from f (jD). By subvector lemma and sum-of-subset lemma, f (qjD) is a Beta distribution:. p(cjx) / p(c). Y. x2x. p(xjc);. (2). where x is a variable in x. p(xjc) is the class-conditional density of x given class c. Let denote the vector whose elements are the parameters of the density of p(xjc). In a Bayesian learning framework, we assume that is an uncertain variable [10] and can be learned from a training data set. This estimation is at the heart of training in a naive Bayes. Suppose x is a discrete variable with k +1 possible values. In principle the class label c of the data vector x dictates the probability of the value of x. Thus the appropriate p.d.f. is a multinomial distribution and its parameters are a set of probabilities f1; 2 ; : : : k+1 g such P that for each possible value Xj , p(x = Xj jc) = j and j j = 1. Now, let (1 ; 2 ; : : : k ). We choose a Dirichlet distribution with parameters 1 ; : : : ; k+1 as the prior for . Given a P P Beta( j2 ( j + yj ); + n ; j2 ( j + yj )) (1) train data set, we can update p(x = Xj jc) by its expected value: However, by the properties of the Dirichlet distribution, (3) p^(x = Xj jc) = j ++ nycj ; there exists an alternative and simpler way to compute c 2.

(29) p(x|c). C. Implications of Perfect Aggregation An implication of perfect aggregation is that to estimate the posterior probability of a union of disjoint events, there is no need to estimate the probabilities of individual events. Another implication is that when perfect aggregation holds, identifying the exact outcome of an observation in D will not be necessary. In this case, we only concern about whether the result of an observation is an outcome included in the event corresponding to the probability of interest. Thus, when a probability of interest is known before training, perfect aggregation property can simplify the training e ort in identifying the outcome of an observation in D. These implication allow us to derive a lazy discretization method and a multi-interval classi

(30) er for naive Bayes. In our previous work [16], we proposed a lazy discretization method for continuous variables. This method waits until one or more test data are given to determine the cut points for each continuous variable. This method produces only a pair of cut points surrounding the value of a test datum for each variable. That is, it creates one interval (denoted as I ) and leaves the other region untouched. From the training data, we can estimate p^(x 2 I jc) by the expression given in Section II-B and use this estimate to classify the test datum. This method can invoke di erent instantiations to determine the cut points. For example, we can select a pair of cut points such that the value of x is in the middle and the distance between the cut points is the same as the width of the intervals created by ten-bin1 We will call it \lazy ten-bin." Similarly, we can have \lazy entropy," \lazy bin-log," etc. This discretization method is derived from the perfect aggregation and other properties of the Dirichlet distribution. Suppose that partition independence assumption holds. Then by the sum-of-subset lemma of the Dirichlet distribution, \xjc" will have a Beta prior with parameters R R I p(xjc)dx and (1 ; I p(xjc)dx), where c is a class and is an equivalent sample size. By perfect aggregation, we can estimate p^(x 2 I jc) by counting how many c examples with x 2 I and how many are not. In other words, there is no need to check the exact value of x for those examples whose value of x is not in I . This way, it may simplify the training e ort. In order to show that the lazy ten-bin can perform as well as well-know discretization methods, we empirically compared lazy ten-bin, ten-bin, entropy-based, and a Gaussian version of naive Bayes on ten real data sets from UCI machine learning repository [17]. Table I gives the average results of the ten real datasets with di erent discretization methods. The detail can be found in [16].. p(x|c). a1. a2. x. a1. a2. x. Fig. 1. Partition independence assumption. where nc is the number of the training examples belonging to class c and ycj is the number of class c examples whose x = Xj . Since a Dirichlet distribution is conjugate to multinomial sampling, after the training, the posterior distribution of is still a Dirichlet, but with the updated parameters j + ycj for all j . This property allows us to incrementally train a naive Bayes. In practice, we usually choose the Jaynes prior [14] j = = 0 for all j and have p^(xjc) = yncjc . However when the training data set is too small, this often yields p^(xjc) = 0 and impedes the classi

(31) cation. To avoid this problem, another popular choice is j = 1 for all j . This is known as smoothing or Laplace's estimate [15]. If x is a continuous variable, discretization is often used. Generally, discretization involves partitioning the domain of x into k + 1 intervals as a pre-processing step. Then we can treat x as a discrete variable with k + 1 possible values and conduct the training and classi

(32) cation. More precisely, let Ij be the j -th discretized interval. Training and classifying in a naive Bayes with discretization is to use p^(x 2 Ij jc) as an estimate of p^(xjc) in Equation (3) for each continuous variable. This is equivalent to assuming that after discretization, the class-conditional density of x has a Dirichlet prior. We call this assumption \Dirichlet discretization assumption". Apparently, this assumption holds for all well-known discretization methods, including ten-bin, entropy-based, etc. See [4] for a comprehensive survey. Dirichlet discretization assumption is reasonable because of another implicit assumption described below. Let f (xjc) be the \true" probability density function of p(xjc). Assuming that f (xjc) is integrable everywhere. Then for any discretized interval Ij , the \true" probability of p(x 2 R Ij jc) = Ij f (xjc)dx. By choosing equivalent sample size , the Dirichlet parameter corresponding to random variR able \x 2 Ij jc" is Ij f (xjc)dx. We call this assumption \partition independence assumption." By partition independence assumption, any discretization of x can have a Dirichlet prior. III. Classifying Set and Interval Data Partition independence assumption implies that for any A. Set and Interval Data interval, the Dirichlet parameter corresponding to this inWe begin with some necessary de

(33) nitions. terval depends only on the area below the curve of the De

(34) nition 1: A variable is said to have Set Values if its p.d.f. f (xjc), but is independent of the shape of the curve in the interval. In Figure 1, the shape of the p.d.f. curves value is a set. in [a1 ; a2 ] are di erent, yet the Dirichlet parameters corre- 1 Ten-bin is a discretization method that divides the domain of a sponding to this interval for these two p.d.f. are identical. continuous variable into ten equal width bins. 3.

(35) TABLE I. Average accuracies of naive Bayes with different discretization methods. Dataset Average Win:Loss. Lazy Ten-bin 75.99 {. Ten Bins 75.27 8:2. For example, a variable X with its value equal to fa; b; cg is said to have a set value, where a; b; c are some possible states of X . De

(36) nition 2: A variable is said to have interval values if its value can be an interval; a vector is a piece of interval data if one of its element variable has an interval value. For example, V1 includes a variable A, and A = [20:5; 38:5]. Then V1 is an interval data. De

(37) nition 3: When an element of a vector has a set value which consists of interval members, this vector is a MultiInterval Data. For example, V2 includes a variable B , and B = f[20:5; 38:5]; [50:5; 60:5]g, then V2 is a multi-interval data.. c. Gaussian 69.93 9:1. C. Merging Point Data into Set and Interval Data Now, we will describe that how more than one query vectors can be merged and classi

(38) ed a single query vector when we know these vectors have the same unknown class label, then we can form interval-value data. Consider the tree classi

(39) cation problem in Section I. If the di erence of the two leaves' length is not very large, it may be reasonable to assume that this kind of trees have leaves with length within the interval formed by the length of the two leaves that we have got. Then, we can use our multi-interval query method that was discussed in Section III-B to classify the merged query vector. So, when we observe the set of query vectors with the same unknown class label, we will form an interval for each continuous feature that is bounded by the minimum and the maximum values of each feature in that set. But we can not use the interval for query without further consideration, because there are some situations that may not improve the performance. Hsu and Huang [16] concluded that to avoid performance degradation, a discretization method should partition the domain of a continuous variables into intervals such that their cut points are close to decision boundaries to minimize the distortion due to the discretization and their width should be suciently large to cover suciently many training examples. Similar reasoning applies to this case. If the interval is too narrow, the number of examples in the training data set in that interval will be too small to allow accurate estimation of p^(x 2 I jc). To avoid this, we will set a minimal interval threshold. If the interval is smaller than that threshold value, we will extend both ends of the interval to reach the minimal threshold. If the interval is too wide, it may contain decision boundaries, and degrade the performance. Consider the two conditional distributions as shown in Figure 2, and assume data D1 and D2 have the same class label C1 . Based on our previous discussion, we will form an interval I , but the interval I will include decision boundaries. If we classify the data D1 and D2 individually, the results will be correct (both D1 and D2 will be classi

(40) ed to class C1 ). But if we use the interval I to classify them, the result will be wrong (Both D1 and D2 will be classi

(41) ed to class C2 ) because the area under the p.d.f given C1 is smaller than the area of the p.d.f given C2 . To avoid this, we must set a maximal interval threshold. If an interval is larger than that threshold, we will divide it into multiple intervals, based on a suitable width SI . In the case of Figure 2, we will form two intervals I1 and I2 such that one of them includes data D1 or D2 and the width of them were set as the width of SI . When we use the. B. Training and Classi

(42) cation for Set and Multi-Interval Data In the Section II-A, we described the approach of Lazy Discretization. If a query vector contains interval data, we can simply let the discretized interval I be the given interval and no more discretization is necessary. That is a direct extension of Lazy Discretization for Interval Data. Furthermore, when we are interested in several di erent segments of a variable simultaneously, the class-conditional density of our interested segment given class c can be done by p(x 2 Im jc), where Im = fI1 ; I2 ; :::; Ik g, and k is the number of the segments we are interested in. By the perfect aggregation and the sum-of-subset lemma of Dirichlets, we can estimate p^(x 2 Im jc) by the Equation (4).. p^(x 2 Im jc) = m ++ nycm :. Entropy 75.76 5:5. (4). It is the extension of Equation (3), where ycm is the number of class c examples whose x 2 Im , and the m = 1 + 2 + ::: + k . So, we can handle a multi-interval data by Equation 4. When the variable is a discrete variable and has set value, the Equation (4) is also can be used. Note that we still assume the query examples to contain discrete and continuous values are usually. To speed up the estimation of p^(x 2 I jc) in our implementation, we can divide each domain of continuous variable into a large number of equal-width bins. Then we count the number of training examples falling in each bin for a given class c and save them in a table. After that, we can calculate an approximate value of p^(x 2 I jc) by examining the table. The larger number of bins that we divide in advance, the closer the estimated will be to the real value. In our experiments, we divided each continuous variables into one thousand equal-width bins. 4.

(43) Fig. 3. Two conditional distribution Fig. 2. Two conditional distribution. learning repository [17] for our experiments. We

(44) rst partitioned each data set into several subsets (the number of subsets is equal to the number of classes in that data set) according to the data's class label. In each subset, we randomly selected twenty percentage of each class for test and the remnant for training. In the test set, we merged two test vectors from the same class according to the approach described in Section III-C. We also used the same sets (training and test) to obtain the results of the "lazy tenbin" which was proposed by [16]. In this case, only one query vector is classi

(45) ed at a time. We repeated the experiment ten times and reported the average and the standard deviation of the accuracies. For comparison, we also list the experimental results of "Tenbin", "Entropy" and "Gaussian" for the same set of data sets. However, the accuracies reported here were obtained by running

(46) ve-fold cross validations. Table II gives the result, which reveals that our method is signi

(47) cantly better than the "lazy ten-bin" in all the data sets. B. Discussion We will show that a classi

(48) er must deliberately take advantage of that knowledge or the knowledge will not improve the expected accuracy, and this is the case in general. Consider a \random" classi

(49) er which randomly guess the class of query vectors. Suppose that two query vectors were classi

(50) ed individually with this \random" classi

(51) er into one of n classes. The expected value of the accuracy can be derived as follows.. intervals I1 and I2 for the query, the result will be correct in this case. However, consider the conditional distributions as shown in Figure 3 and assume that interval I is too wide. If there are other query examples whose values fall in the region of interval I , and we only use the intervals (I1 and I4 ) created by the boundaries(D1 and D5 ), then we will lose the information from other query examples (D2 , D3 , D4 ). To avoid this, we also create other intervals (the interval's width are set to be SI too) in order to include the information of the other data. For example, in order to include all

(52) ve data in Figure 3, we discrete the interval I into four intervals (I1 , I2 , I3 , and I4 ), then we can use the information from D2 to D4 . In all our experiments, we set the minimal interval threshold of each feature equal the width of "

(53) fty-bin", which was derived by dividing the domain of each continuous variables in the training set into

(54) fty equal-width bins for all the data sets, except the data set "Iris". We set the minimal interval threshold for the data set "Iris" equal to the width of "

(55) fteen-bin", because the size of this data set is too small. We set the maximize interval threshold equal to "four-bin",2 We set the width of SI equal to "

(56) fty-bin" by taking the maximum number of bins in the experiment of [16]. In summary, to improve performance, when an interval is too narrow, we will extend it to as wide as SI . If it is too wide, we will divide it into multiple intervals and the width of each subinterval is also set to SI . Then we can use our multi-interval classi

(57) er.. ; 2 1 1 n ; 1 ; + 1 2 n 2n n + 1 1 n;1 = n12 + n1 n ; n = n2 + n2 = nn2 = n1. E1 =. IV. Experimental Results and Discussion. ; 2 2 1 2 ( ). 2. 2 0 ( n ; 1 )2 0. 2. n. A. Experiment To observe the e ectiveness of our approach to real world Now, suppose we know that the two vectors actually beproblems, we select

(58) ve data sets from the UCI machine long to the same class and classify them together using the \random" classi

(59) er. The expected value of the accuracy is: 2 We investigated the experiments in [16] and found that when the ; ; 1 1 number of discretized equal-width bins is larger than four bins, the E2 = 11 22 n1 + 10 02 n ; accuracies reach plateau situations for most data sets. n =n. 5.

(60) TABLE II. Accuracies of naive Bayes with different approach Dataset breast iris pima sonar vehicle Average Win:Loss. Merge 98.601.31 100.000.00 81.714.22 85.716.68 64.225.36 86.05 -. Lazy Ten-bin 94.392.46 97.004.33 76.123.51 78.576.12 61.082.71 81.43 5:0. Ten Bins 94.372.16 94.673.40 75.014.64 75.553.8 62.061.39 80.33 5:0. Entropy 94.371.93 95.331.63 74.472.92 72.695.88 62.292.15 79.83 5:0. Gaussian 92.792.61 96.003.89 74.342.57 69.215.88 43.263.82 75.37 5:0. D2 still is wrong. We called this situation as Case2. For example, data D2 is classi

(61) ed with data D3 which also belong to class C1 actually and is near the region of Bayes error of C1 , then the Case2 is occurred.. In general, if a class can be di erentialed, the region of Bayes error of this class usually appears on the region which has lower probability. Hence, in our experiments if one data falls in the region of Bayes error of its actual class, the probability of the another data with the same class label which is near or falls in the same region is also lower. So, the frequency of Case1 occurred that is often larger than the frequency of Case2. That is why the accuracies were improved signi

(62) cantly in all the data set in our experiments. In the experiment of Section IV-A, we only merged two test-data have the same unknown class label for classi

(63) cation. We also study the e ect of our method, if we merge more than two data that have the same unknown class label in our approach. We selected a larger data set "waveform" from UCI, and repeated the experiment in Section IV-A with di erent number of test-data being merged (from 2 to 50). Figure 5 shows the results. The

(64) rst value in Figure 5 was generated by considering only one test-data and the lazy ten-bin was applied. The result show that as the number of the merged test-data increased the curve rise and then reach a peak before it drops gradually. We will try to explain this phenomenon. In the

(65) rst phase, the curve rise, because with the number of merged test-data increased the occurrence of Case1 will be more often than Case2. That is because with the number increased the probability that both end points of the interval I fall in the same region of Bayes error of a class will descend. However, why dose the accuracy drops gradually? Recall that if the query interval I is too wide, it may degrade a performance which was mentioned in Section III-C. Hence, we will discretize it into multiple intervals. This is to avoid the situation as shown in Figure 2. But when the interval I in Figure 2 include too many other test-data, the combination of those multiple intervals which we will discretize may approach to the original interval I . Hence, the disretization will not helpful to when there are too many test-data. So, if we merge too many test-data, the situation like the above may occur. That explains why the curve drops gradually. When the curve reaches a peak, then it is the optimum region where the number of test-data we should merge.. Fig. 4. Two conditional distribution. The expected value of the two cases are equal. This result implies that the information that we know the two vectors come from the same class seems not helpful for improving the accuracy. However, why were the accuracies improved signi

(66) cantly in all the data set in our experiments? Consider the two class-conditional density functions of a variable x given class C1 and C2 as shown in Figure 4. Assume the data D2 belongs to class C1 actually and falls in the region of Bayes error3 of class C1 . If we classify data D2 individually based on the Bayes decision rule, D2 will be classi

(67) ed as C2 which is incorrect. But if we classify data D2 with another data D1 which belongs to class C1 actually and is far away from the region of Bayes error of C1 . Now, if we classify data D2 and D1 together based on the area under C1 and C2 within the interval I , we can obtain the correct result. It is because the area under the p.d.f given C1 is larger than the area of the p.d.f given C2 . Hence, if a data falls in the region of Bayes error of its actual class and the data can be classi

(68) ed with another data having the same class, it is more likely that the data will be classi

(69) ed correctly. We called this situation as Case1. But if data D2 is classi

(70) ed with unsuitable data then the area under the p.d.f given C1 is smaller than the area of the p.d.f given C2 , the classi

(71) ed result of data 3 Bayes error is the probability that a sample is assigned to the wrong class when the Bayes decision rule is applied. The region of Bayes error of a class C for a variable x is the region in the domain of x where x will be misclassi

(72) ed if x is of class C.. 6.

(73) randomly selected one thousand feature vectors from each sentence. Hence, for each speaker there were four thousand feature vectors for training and one thousand feature vectors for test. We ran the procedures ten times of each experiment on di erent number of speakers and reported the average and the standard deviation of the accuracies. We also showed the results of the method "lazy ten-bin". Table III gives the results. The results show that our method outperformed the "lazy ten-bin" signi

(74) cantly in the experiments, especially when the number of speakers is increased.. 1. 0.9. 0.8. 0.7. Accuracy. 0.6. 0.5. 0.4. 0.3. VI. Conclusions and Future Work. 0.2. A discrete variable as well as a discretized continuous variable in a naive Bayesian classi

(75) er are usually assumed to have a Dirichlet prior. Perfect aggregation of Dirichlets implies that we can estimate the class-conditional probabilities of discretized intervals regardless of how other region of the domain of the continuous variable is discretized. Because of perfect aggregation of Dirichlets, we have presented a new approach that could process multiple interval queries of naive Bayes classi

(76) ers. In order to form interval data, we merged more than one query vectors from the same unknown class to one. Experimental results against standard data sets from UCI repository show that when we merged two query vectors with the same unknown class label, our approach can outperform traditional approach which only one query vector at a time. The approach can be applied successfully to the task of speaker recognition. We show that by merging an appropriate number of query vectors with the same unknown class, the acccuracies of naive Bayesian classi

(77) ers will be promoted signi

(78) cantly. Hence, if query vectors include interval data or the knowledge of which data come from the same unknown class, our approach will be suitably applied. Although our approach improves the accuracy of naive Bayes classi

(79) er when we merged more than one query vectors, some parameters need to be set to obtain optimal results. But the setting of parameters is case dependant. In our experiments, those parameters were determined empirically. Hence, our future work includes to develop a approach to set those parameters automatically, and we also plan to investigate whether our approach can be applied on general Bayesian classi

(80) ers.. 0.1. waveform 0. 5. 10. 15. 20 25 30 Number of merged test−data. 35. 40. 45. Fig. 5. Accuricies of di erent number of merged test-data. Di erent data set has di erent distribution and the optimum number of mergence is also di erent. The optimum number is case dependent. V. Application to speaker Recognition. The proposed method was applied to a textindependence and close-set speaker recognition task. This task is particularly relevant to our approach because in this task, we usually know which set of feature vectors is from the same unknown speaker. Since a large number of feature vectors can be extracted from a short speech sentence, and obviously those vectors must come from the same speaker, a speaker recognition system should take advantage of this information. The database for the experiments reported in this paper is a subset of the MAT2000Edu which is a speech database of Mandarin Chinese collected from many colleges in Taiwan. We used the speech data that were recorded in the National Chiao-Tung University. All speech signals were digitally recorded in a laboratory using a personal computer with a 16-bit sound blaster card and a head-set microphone. The sampling rate was 16 kHz. A 30-ms Hamming windows was applied to the speech every 10 ms. For each speech frame a 12th-order linear predictive and a log energy analysis were performed. A feature vector for query was consisted of the twelve linear predictive parameters and a log energy parameter of a frame. There are more than ten sentences that were recorded from each speaker and each sentence could be extracted more than four thousand feature vectors. In our experiments, all the original feature vectors were considered to use and no silence-removing algorithm was applied. We randomly selected 2 to 6 speakers from the subset. The evaluating procedures were described in the follow. We set the number of merged test-data to forty, which was determined empirically, and the other parameters (the minimal interval threshold, the maximal interval threshold, and the SI ) were set the same as the experiments in Section IV-A. We randomly selected

(81) ve sentences for each speaker, one for test and the others for training. And we. Acknowledgements. The research reported here was supported in part by the National Science Council of ROC under Grant No. NSC 89-2213-E-001-031. The speech data set was the courtesy of the Speech Processing Laboratory, Department of Communication Engineering, National Chiao Tung University, Taiwan. References [1] Ismail Parsa. KDD-CUP 1997 presentation, 1997. [2] Pedro Domingos and Michael Pazzani. On the optimality of the simple Bayesian classi

(82) er under zero-one loss. Machine Learning, 29:103{130, 1997. [3] George John and Pat Langley. Estimating continuous distributions in Bayesian classi

(83) ers. In In Proceedings of the Eleventh. 7.

(84) TABLE III. Accuracies of the application in speaker recognition speakers 2 3 4 5 6 Average Win:Loss. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]. Merge 99.601.13 99.201.07 95.502.01 95.362.17 92.832.72 96.50 -. Annual Conference on Uncertainty in Arti

(85) cial Intelligence (UAI '95), pages 338{345, 1995. James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning: Proceedings of the 12th International Conference (ML '95), San Francisco, CA, 1995. Morgan Kaufmann. Usama M. Fayyad and Keki B. Irani. Multi-interval discretization of continuous valued attributes for classi

(86) cation learning. In Proceedings of the Thirteenth International Joint Conference on Arti

(87) cial Intelligence (IJCAI '93), pages 1022{1027, 1993. Ron Kohavi and Mehran Sahami. Error-based and entropybased discretization of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD '96), pages 114{119, Portland, OR, 1996. Nir Friedman, Moises Goldszmidt, and Thomas J. Lee. Bayesian network classi

(88) cation with continuous attributes: Getting the best of both discretization and parametric

(89) tting. In Machine Learning: Proceedings of the 15th International Conference (ML '98), San Francisco, CA, 1998. Shyi-Ming Chen and Jeng-Yih Wang. Document retrieval using knowledge-based fuzzy information retrieval techniques. Systems, Man and Cybernetics, IEEE Transactions on, 25:793{803, 1995. Samuel S. Wilks. Mathematical Statistics. Wiley and Sons, New York, 1962. David Heckerman. A tutorial on learning with Bayesian networks. In Michael I. Jordan, editor, Learning in Graphical Models, pages 301{354. Kluwer Academic Publishers, Boston, 1998. M. N. Azaiez. Perfect Aggregation in Reliability Models with BAyesian Updating. PhD thesis, Department of Industrial Engineering, University of Wisconsin-Madison, Madison, Wisconsin, 1993. Y. Iwasa, S. Levin, and V. Andreasen. Aggregation in model ecosystem: Perfect aggregation. Ecological Modeling, 37:287{ 302, 1987. Tzu-Tsung Wong. Perfect Aggregation in Dependent Bernoulli Systems with Bayesian Updating. PhD thesis, Department of Industrial Engineering, University of Wisconsin-Madison, Madison, Wisconsin, 1998. Russell Almond. Graphical Belief Modelling. Chapman and Hall, New York, 1995. Bojan Cestnik and Ivan Bratko. On estimating probabilities in tree pruning. In Machine Learning { EWSL-91, European Working Session on Learning, pages 138{150. Springer-Verlag, Berlin, Germany, 1991. Chun-Nan Hsu, Hung-Ju Huang, and Tzu-Tsung Wong. Why discretization works for naive bayesian classi

(90) ers. In Machine Learning: Proceedings of the 17th International Conference (ML 2000), San Francisco, CA, 2000. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.. 8. Lazy Ten-bin 80.071.15 69.742.12 57.870.93 52.600.87 49.630.67 61.98 5:0.

(91)