A Novel Approach to Evaluate the Top K Interesting Ranks from Web Search Engines

全文

(1)A Novel Approach to Evaluate the Top K Interesting Ranks from Web Search Engines Chien-I Lee, Cheng-Jung Tsai, Yu-Chiang Li and Cheng-Tao Wu Institute of Information Education National Tainan Teachers College, Tainan, Taiwan, R.O.C. Email:leeci@ipx.ntntc.edu.tw. Abstract In recent years, several search engines had been developed to help people find interesting information among the rapidly increasing number of web pages in the Internet. To obtain useful and reasonable searching results, users may submit queries with more than one query terms combined by a Boolean expression, which has been supported by all existing search engines. However, these search engines all put the same emphases on every query term combined by the Boolean expression. In other words, all search engines nowadays do not consider that users may want to put more emphases on one term rather than on the others. In this paper we propose a novel approach, called Extreme Score Analysis method (ESA method), to solve this problem. Our ESA method can efficiently find the top K (K > 0) interesting web pages when the user assigns different weights for query terms.. 1. Introduction As the fast process of computer and network technologies, computers connected in the Internet will soon become the indispensable electrical appliances of our daily life. Through the Internet, people can get their desired information much easier and quicker. But, meanwhile, the rapidly growing data carried on the Internet really make people very difficult to search and filter the appropriate information they want. In recent years, several search engines [3, 14, 15, 17, 22, 24] had been developed to reduce such overloads. In general, in order to support a full-text information retrieval, the search engines filter terms in the contents of web pages to establish a corresponding inverted index [23]. When users submit a query with some proper query terms as keywords (these query terms can be combined by a Boolean expression), the search engines will look the terms up in the inverted index to find out the corresponding web pages, and further rank these web pages by their own predetermined term weighting functions and vector similarity functions. Finally, users can view those ranked web pages and get more relevant information in the front of them. Usually, users may submit queries with more than one query terms combined by a Boolean expression to enlarge or narrow their searching interests. Although the search engines nowadays all support the Boolean expression, they. put the same emphases on all the query terms combined by the Boolean expression. For example, when a user submits a Boolean expression with two query terms, “music and download”, a typical search engine will return the web pages which contain both terms “music” and “download”, and ranks these web pages according to their scores, which could be the sum of the individual contributed values for the two query terms in their own context. (Note that the contributed value for a term denotes the value returned from its term weight function.) However, users may put more emphases on one term rather than on the other. Continuing with the foregoing example, users may want to look for the web pages, which contain “music” for “download”, but put more emphases on “music” rather than on “download”. For quantify such a case, users can be allowed to assign different weights to each query term in the Boolean expression, e.g. “(0.8) music and (0.2) download”, which means the degrees of importance of the two query terms are in a ratio of 4:1. To fulfill such a requirement for different weights of query terms, it is necessary to re-calculate the final score for each web page according to the new given weights. After this re-calculation of web pages’ scores, users can get the true ranking results. However, a typical search engine merely returns matched web pages with final scores, ranks, etc., but without the individual contributed value for each query term. Therefore, to re-calculate the final scores, there are two problems needed to be resolved. First, we need to know the scoring function of the search engine. Usually, due to commercial secret, we usually have no idea of the search engine’s scoring function. Basically, we can solve this problem by adopting another scoring function, which can be defined by ourselves. Even so, there still exists the second problem is that we have to scan every returned web page’s content to get each query term’s contributed value. Such a task will be extremely time-consuming. Moreover, in a real situation, for the sake of time or no patience, most of users usually only view the top K web pages from those HN returned web pages [4] (K << HN, where HN is the total number of returned web pages). To avoid the overhead of re-calculating the whole HN returned web pages, in this paper we will propose a novel approach, called Extreme Score Analysis Method (ESA method), which can find the top K interesting web pages without re-scanning the HN returned web pages to get the individual query terms’ contributed values. Instead, our method will inform users that the top K interesting web pages they request will be among the top R returned web.

(2) pages (K ≤ R << HN). From the performance study, we can find that the value of R will be close to that one of K. For example, if one user submits a query with two terms and wants to get the top K web pages according to his/her weights (0.6 and 0.4, respectively) of query terms, the value of R will be approximately equal to 1.5 × K. The rest of the paper is organized as following. In Section 2, we survey some basic components of a typical search engine and some related term weighting functions and vector similarity functions. Section 3 describes the basic idea of our Extreme Score Analysis Method. Section 4 presents an experimental evaluation of our method. Finally, Section 5 is the conclusion.. 2. The Components of Search Engines In general, a typical search engine is composed of three components: (1) Indexer Robot The indexer robot [5] is an autonomous WWW browser, which communicates with WWW servers using HTTP (Hypertext Transfer Protocol). It visits a given WWW site, traverses hyperlinks in a breadth-first manner, retrieves WWW pages, extracts keywords and hyperlink data from the pages, and inserts the terms and hyperlink data into an index. A list of target sites is given to the indexer robot for creating the index initially. (2) Indexer database The Indexer robot reads web pages and sends these web pages to the indexer database to crate indexing records. This indexer database is also named an inverted index. Each web page in the inverted index is represented as a vector d = (d1,…,dm), where di is the weight of the ith term ti in representing the web page, 1 ≤ i ≤ m. If the original web page is changed, the inverted index should be updated. The update is not made in the inverted index until the indexer robot has revisited the changed web page again. (3) Ranking In order to rank all related web pages, the search engine assign relevance scores to them [18]. The scores indicate the similarities between some given query terms, which can be similarly represented as a vector q = (q1,…,qm) (where qi is the weight of the ith term ti , 1 ≤ i ≤ m), and the web pages. However, each search engine may employs a characteristic term weighting function, which assign weight to each term, and a characteristic vector similarity function, which assign relevance scores to web pages. A well-used term weighting function is called tf×idf [11], which assumes that term importance is proportional to the standard occurrence frequency of each term k in each web page Hi (that is, FREQik) and inversely proportional to the total number of web pages in the web page collection (that is, HOPFREQi) to which each term is assigned. Then, a general term weighting function of tf×idf can be WEIGHTik = FREQik × [log2 (n) – log2 (HOPFREQk) + 1], where n is the total number of web. pages in the collection. There were many others proposed term weighting functions, such as Signal-Noise Ratio [6] and Term Discrimination Value [21]. Although every term weighting function has its own properties, all of them propose the same hypothesis that term importance is proportional to the standard occurrence frequency of each term in each web page as the one in tf×idf. Nevertheless, tf×idf is superior to the others in the aspects of efficiency and cost [12, 18]. Furthermore, because the web pages are written in hypertext markup language (HTML), each search engine may take some related factors (e.g. hyperlink, number of times the keywords occur in the document title…etc.) into account to provide more suitable ranking results [1]. Consequently, the most term weighting function applied by the existing search engines are derived from tf×idf. Although, there were also many kinds of vector similarity functions, they all exhibited one common property, namely that the similarity value increases when the weight of the common properties in two vectors increases. Jaccard and cosine coefficient, which are two kinds of vector similarity functions, have been widely used for the evaluation of retrieval functions [19, 20]. Both values of the two similarity functions increase when the dot product of two vectors increases.. 3. Extreme Score Analysis method As discussed in Section 2, we can assume that when a user submits a query with n query terms, the score KSi (here, we call it original score) search engines assign to each web page Hi (i denotes the original rank of this web page among the returned matched web pages) will be n. KS i = ∑W（ , j b (( FREQij + a )/HOCFREQ j）+ c） j =1. where Wj is the weight of term j given in the user’s query and (b((FREQij + a) / HOCFREQj) + c) is the contributed value of term j in KSi. Variables a, b and c are used by each search engine to improve the tf×idf under the consideration of some possible factors. Nevertheless, what our research concern is only about the ranking result of returned web pages. Whatever the values of Variables a, b and c will be, the ranking result will never change. Furthermore, as mentioned in Section 1, all current search engines put the same emphases on every query term (that is, Wj = 1 / n) Therefore, the original score can be simplified as n. KSi = ∑ (1 / n )× ConVij , 1. where ConVij = b((FREQij + a) / HOCFREQj) + c. However, users may assign each term weight unequally. Obviously, alterations in term weights will cause the changes of original scores. The score that is re-computed from the original score is called target scores KTi and can be defined as.

(3) n. KTi = ∑W j × ConVij , 1. where ∋. j ,W j ≠ 1 / n.. For example, suppose a user submits a query with two query terms, term p and term q, combined by a Boolean expression and the user want to assign 0.7 and 0.3 to each term’s weight, respectively. Then, the original score of the ith matched web page will be KSi = 0.5ConVpi + 0.5ConVqi. And, the target score of the ith matched web page will be KTi = 0.7ConVpi + 0.3ConVqi. To be convenient to explain our method, we will first consider the case of two query terms, term p and term q. Without loss of generality, we divide ConVp and ConVq by their maximum respectively to limit their values between 0 and 1, i.e., 0 ≤ ConVp, ConVq ≤ 1. Then, there are some theorems, which are applied in our method, need to be stated and proved in the following.. intersects the point (ConVpi, ConVqi) denotes the score of each web page Hi when the term weights are Wp, Wq respectively. In particular, Figure 1 represents the distribution of original scores of returned web pages, and Figure 2 represents the distribution of target scores. From the comparison of Figure 1 and Figure 2, we can observe that the dashed line, which denotes the score of web page Hi, will alter when the weights of the two query terms vary. From Theorem 1 we can know the reason is that original score considers the weights of both term are equivalent (that is, Wq / Wp = 1), but target score does not (that is, Wq / Wp ≠ 1). As well, from Theorem 2 we can obtain that the further the distance between the dashed straight line and the origin is, the higher the score denoted by the dashed straight line will be. ConVq = ConVp. ConVq 1 H2. Source Score KSi. H1 KS1 = 0.5 ConVp ＋ 0.5ConVq ( ConVp1 , ConVq1 ) KS2 = 0.5 ConVp ＋ 0.5ConVq. ( ConVpi , ConVqi ). Hn. Theorem 1. Let x-axis and y-axis denote ConVp and. ConVq respectively in the coordinate plane. Then a linear equation, which passes through the origin and is orthogonal to the straight line SC = Wp × ConVp + Wq ×ConVq, will be ConVq = (Wq / Wp) × ConVp, where 0 ≤ Wp, Wq ≤ 1.. Proof. According to Slope Theorem, the product of slopes of two straight lines will be –1 if they are orthogonal to each other. Therefore, we can obtain the slope of the straight line, which is orthogonal to the straight line SC, is Wq / Wp. Moreover, because the straight line passes through the origin, we can obtain the linear equation will be ConVq = (Wq / Wp) × ConVp. □. 1ConVP. 0 KSn = 0.5 ConVp ＋ 0.5ConVq. KSi = 0.5 ConVp ＋ 0.5ConVq. Figure 1. The distribution of original scores in the coordinate plane. ConV q 1 Target Score KT i. H2 H1 ( ConVp1 , ConVq1 ). Theorem 2. Let SC1 = Wp × ConVp + Wq × ConVq and. SC2 = Wp × ConVp + Wq × ConVq represent two parallel lines in the Coordinate Plane and intersect the straight line ConVq = (Wq / Wp) × ConVp in the point A and point B, respectively. If the length of line segment OA > the length of line segment OB , then SC1 > SC2.. 0.5 ConVp ＋ 0.5ConVq = 0.5. Hi. ( ConVpi , ConVqi ). Hn. KT1 = W p × ConVp ＋ W q × ConVq. ConVq = ( Wq / W p ) × ConVp. Hi KT2 = W p × ConVp ＋ W q × ConVq. 0. KTn = W p × ConVp ＋ W q × ConVq. 1ConV P. KTi = W p × ConVp ＋ W q × ConVq. Proof. First, we can obtain the coordinates of A is (( SC1 × Wp) / (Wp2 + Wq2), (SC1 × Wq) / (Wp2 + Wq2)), and that of B is (( SC2 × Wp) / (Wp2 + Wq2), (SC2 × Wq) / (Wp2 + Wq2)). Then, we can obtain the length of line segment OA = SC1 /（Wp2 + Wq2）1/2 and OB = SC2 / (Wp2 + Wq2)1/2. Clearly, because of OA > OB , we can infer that SC1 > SC2. □. From Theorems 1 and 2, we can obtain Figure 1 and Figure 2 as shown in the following. In both figures, x-axis and y-axis represent the contributed values of query terms p and q, respectively. The point (ConVpi, ConVqi) in the coordinate plane denotes the web page Hi returned by the search engine, and every dashed straight line which. Figure 2. The distribution of target scores in the coordinate plane.. Theorem 3. As shown in Figure 3, let KSi = 0.5ConVp + 0.5ConVq denotes a straight line in the coordinate plane, where 0 ≤ ConVp, ConVq ≤ 1 and 0 < Wq < Wp < 1. Then, we can obtain an infinite number of straight lines KTi = Wp × ConVp + Wq × ConVq which intersect KSi and are orthogonal to the straight line ConVq =（Wq / Wp） × ConVp. Furthermore, we can obtain KTmaxi and KTmini, which satisfy KTmaxi ≤ KTi ≤ KTmini, for each KSi, respectively..

(4) Proof. Solve simultaneously the set of Equations (1) and (2): KSi = 0.5ConVp + 0.5ConVq, KTi = Wp × ConVp + Wq × ConVq... (1) (2). By (1) × 2Wp - (2), we can obtain ConVq × (Wp – Wq) = 2Wp × KSi – KTi. ConVq = (2Wp × KSi – KTi) / (Wp – Wq). (3). By (1) × 2Wq - (2), we can obtain ConVp × (Wq – Wp) = 2Wq × KSi – KTi. ConVp = (2Wq × KSi – KTi) / (Wq – Wp).. (4). 0 ≤ ConVp, ConVq ≤ 1. 1 < Wp < Wq < 0.. (5) (6). By solving simultaneously Equations (3), (5), and (6), we can obtain 0 ≤ (2Wp × KSi – KTi) / (Wp - Wq) ≤ 1. 0 ≤ (2Wp × KSi – KTi) ≤ Wp - Wq. Wq + Wp(2KSi –1) ≤ KTi ≤ 2Wp × KSi. Wq + Wp(2KSi – 1) ≤ KTi ≤ Wp + Wp (2KSi –1).. (7). Similarly, by solving simultaneously Equations (4), (5), (6),we can obtain Wq + Wq (2KSi – 1) ≤ KTi ≤ Wp + Wq (2KSi –1).. (8). Finally, by solving simultaneously Equations (7) and (8), we can obtain the maximum of KTi (that is, KTmaxi = Min(Wp + Wq(2KSi – 1), Wp + Wp(2KSi – 1)) and the minimum of KTi (that is, KTmini = Max(Wq + Wq (2KSi – 1), Wq + Wp(2KSi – 1)), which satisfy KTmaxi ≤ KTi ≤ KTmini for each KSi. □ ConVq = ConVp. ConV q 1. Original Score. 0.5 ConVp + 0.5 ConVq = KSi. Target Score Maximum W p × ConVp + W q × ConVq = KTmini. Hi. ConVq = (W q / W p ) × ConVp. Target Score Minmum. W p × ConVp + W q × ConVq = KTmaxi. 0. 1ConV p. Figure 3. The relation between original scores and target scores.. Theorem 4. Continuing with Theorem 3, let KSi =. 0.5ConVp + 0.5ConVq and KSj = 0.5ConVp + 0.5ConVq represent two straight lines in the coordinate plane. If KSi > KSj, then KTmini > KTminj and KTmaxi > KTmaxj.. Proof. From Equations (7) and (8) in Theorem 3, we can get that if the values of term weights Wp and Wq are fixed, then the larger KSi is, the larger KTmaxi and KTmini will be. Therefore if KSi > KSj, then KTmini > KTminj and KTmaxi > KTmaxj. □ Theorem 5. For every i, suppose there are KTmaxi ≥ KTi ≥. KTmin (1 ≤ i ≤ HN), and satisfy KTmax1 ≥ KTmax2 ≥ … ≥ KTmaxi ≥…≥ KTmaxHN, KTmin1 ≥ KTmin2 ≥…≥ KTmini ≥…≥ KTminHN. If there is a KTmaxj that satisfies KTmaxj ≥ KTmink, then the KTj may be one of the top K of all KTi.. Proof. Suppose there is a KTj (j > K) and satisfies KTmaxj ≥ KTminK, but, KTj should not be one of the top K of all KTi. However, as asserted above, Because of KTmin1 ≥ KTmin2 ≥…≥ KTminK-1 ≥ KTminK, the top K of all KTi will be KT1, KT2, KT3,…, KTK in turn when KTi = KTmini, 1 ≤ i ≤ K. Nevertheless, because of KTmaxj ≥ KTminK, KTj may be bigger than KTK and becomes the Kth rank of all KTi when KTj = KTmaxj. As a result, our previous assumption is not true. Consequently, for each i if KTmaxi ≥ KTmink, then the KTi may be one of the top K of all KTi. □ As stated before, since the search engine do not return the individual contributed values ConVpi and ConVqi for each returned web page Hi, the coordinates of web page Hi, (ConVpi, ConVqi), may locate at anywhere on the straight line KSi = 0.5ConVpi + 0.5ConVqi. As a result, we can not obtain the actual target score KTi for Hi. From Theorem 3, we know that KTi will be between KTmini and KTmaxi, as shown in Figure 3. That is, for given term weight values Wp and Wq, which are assigned by users, we can derive a value interval of target score from each original score. Each value interval will have a pair of maximum target score KTmaxi and minimum target score KTmini. Furthermore, from Theorem 4 we can observe that the distribution of KTmaxi and KTmini decrease gradually with the decrease in original score. Figure 4 shows such a case. Finally, from Theorem 5, to support the correct top K interesting web pages from the returned web pages, all the web pages whose KTmaxi ≥ KTminK might one of the top K target ranks of web pages. That is, the user has only to view those web pages to get the real top K target ranks, instead of viewing the whole HN returned web pages. KTmin1 KTmin2 KTmin3 KTminK KTminK+1 KTminN-1 KTminN. KTmax1 KTmax2 KTmax3 KTmaxK KTmaxK+1. KTmaxN-1 KTmaxN. Figure 4. The distribution of KTmaxi and KTmini. In general, suppose a user submit a query with n different query term weights through a search engine, which returned HN matched web pages, and the user wants the top K interesting ranks of web pages according to the submitted term weights. Without loss of generality, we let.

(5) W1 ≥ W2 ≥,…≥ Wn-1 ≥ Wn, and the analysis steps to evaluate the required top R web pages (R ≥ K) needed to be viewed by the user to get the real top R ranks in our proposed ESA method are shown in Figure 5. In Figure 5, to simply the expressions, we define two functions: 0. n. ∑W. m = j +1. m. 0 S2(1, j) =. if n < j + 1,. if j < 2,. j −1. ∑W. if n ≥ j + 1;. m. if j ≥ 2.. m =1. S1(j, n) =. Step1: /* In this step, ESA method will compute the KTmaxi and KTmini for each returned web page i.*/ For i = 1 to HN For j = 1 to n MinKT [j] = Wj × n × KSi – (n - j) Wj + S1(j, n); MaxKT [j] = Wj × n × KSi – (j - 1) Wj + S2(1, j); End for KTmini = Max(MinKT[j]); KTmaxi = Min(MaxKT[j]); End for Step2 : /* The top K interesting web pages will exist in the front of the top R ranks of returned HN web pages. */ For i = 1 to HN If KTmaxi < KTminK Then R = i – 1; Go to Step3; End if End for Step3: If the user would like to get the real top K target ranks then For i = 1 to R Retrieve Hi and compute the contributed value for each query term j; Score the web page Hi according to the new computed contributed value; End for Rank Hi according to the target scores and return the top K web pages to the user; Else Return the top R original ranks of returned web pages to the user; End if Figure 5. Extreme Score Analysis method (ESA method). There is an example as shown in Table 1 to explain the steps in our ESA method. For simplicity, we consider the case with n = 2 and the search engine returned 20 related web pages. Suppose the user wants to assign 0.3 and 0.7 to each term weight, respectively, and requests the top 3 interesting web pages. That is, n = 2, HN = 20, W1 = 0.3, W2 = 0.7 and K = 3. To understand and verify the result of our ESA method, we used a generator to generate ConVq and ConVp randomly for these 20 web pages and then compute each web page’s original score KSi. Furthermore, the actual value of each KTi can be obtained as shown in Table 1. (Note that in a real situation, ConVq, ConVp, and KTi are unavailable.) In Step 1, ESA method computes the KTmaxi and KTmini for every returned web page. Then, in. Step 2, ESA method will derive the total number R of web pages that the user has to view. In other words, it tries to find the web pages whose KTmaxi ≥ KTminK (= 0.854). In this example, we can find that KTmax7 (= 0.866) > KTminK, but KTmax8 does not. Finally, in Step 3, the user has only to view the top R = 7 returned web pages (i.e., H1, H2, H3, H4, H5, H6, H7) to get the real top 3 interesting web pages with the new given term weights. However, the user also can wait the system to re-scan only the top R (= 7) original ranks of web pages to provide the real top 3 interesting web pages (i.e., H1, H3, H5), without re-scanning all the HN = 20 returned web pages..

(6) Table 1. An example of ESA method (n = 2, HN = 20, W1 = 0.3, W2 = 0.7 and K = 3) Serial numbers H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20. Rank of KSi 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20. KSi 0.934 0.906 0.896 0.86 0.857 0.811 0.776 0.726 0.648 0.622 0.579 0.542 0.532 0.519 0.429 0.349 0.279 0.114 0.06 0.056. 4. Performance evaluation 4.1 Experiment model In Section 3, we have stated the basic idea of ESA method. In this section, because of the distribution of original score is not fixed, we establish a simulation model, which uses a generator to randomly generate the original scores for returned web pages, to evaluate the efficiencies of ESA method. There are two performance measures in our evaluation. One is Per, and the other is Mul, where Per denotes the percentage of web pages that need to be viewed (that is, Per = 100 × ( R / HN) %), and Mul denotes the proportion of users’ top K interesting web pages to the number of returned web pages they have to view (that is, Mul = R / K). Each Per, R, and Mul is obtained by averaging the results of simulating 100 times.. 4.2 Result The results of experiment simulation are presented in Table 2. The average of all Mul is 4.41, which means that on average users have only to view about 4.41 times the number of K to get their real top K interesting web pages. When users allow the system to re-calculate the real top K target ranks, the small value of Per in most cases has shown the efficiency of our method again. The value of Per will become much larger only when the value of K is close to that of HN. However, search engines usually returned much more web pages than user’s interesting top K web pages. That is, HN is usually much larger than K. In the worst case (Wp = 0.9, Wp = 0.1, HN = 5000, and K =10), the number of web pages users have to view is about 10.8 (=Mul) times that of their top K interesting web pages. However, in this case, the total number of returned web pages is 5000, and the system has only to re-calculate 108 returned web pages to provide the real top K target ranks, which is still much more efficient than re-calculating all 5000 returned web pages.. KTmini 0.908 0.868 0.854 0.804 0.8 0.735 0.686 0.616 0.507 0.471 0.411 0.359 0.345 0.327 0.257 0.209 0.167 0.068 0.036 0.034. KTmaxi 0.96 0.944 0.938 0.916 0.914 0.887 0.866 0.836 0.789 0.773 0.747 0.725 0.719 0.711 0.601 0.489 0.391 0.16 0.084 0.078. ConVp 0.92 0.87 0.98 0.78 0.95 0.77 0.76 0.99 0.713 0.406 0.666 0.997 0.746 0.538 0.2 0.09 0.34 0.22 0.091 0.01. ConVq 0.948 0.942 0.812 0.94 0.764 0.852 0.792 0.462 0.583 0.838 0.492 0.087 0.318 0.5 0.658 0.608 0.218 0.008 0.029 0.102. KTi 0.928 0.892 0.93 0.828 0.894 0.795 0.77 0.832 0.674 0.536 0.614 0.724 0.618 0.527 0.337 0.245 0.303 0.156 0.072 0.038. Finally, we can find that in spite of what each term weight will be, the larger the HN is, the less the Per will be. And the less the difference between each query term is, the less the Per will be. Furthermore, the smaller the K is, the smaller the Per will be. Table 2. The result of experiment simulation W. HN 5000. Wp = 0.6. 1000. Wq = 0.4 500. 5000 Wp = 0.7 Wq = 0.3. 1000 500 5000. Wp = 0.8 1000 Wq = 0.2 500 5000 Wp = 0.9 Wq = 0.1. 1000 500. Average. K 10 20 30 10 20 30 10 20 30 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30 10 20 30. Rr 17 33 47 17 32 46 16 31 46 45 26 50 74 25 47 72 25 48 71 48 93 122 41 81 125 42 78 122 108 201 280 93 179 279 96 177 245. Perr 0.33% 0.66% 0.924% 1.62% 3.10% 4.57% 3.01% 6.07% 9.01% 44.3% 0.51% 1.00% 1.47% 2.45% 4.68% 7.19% 4.85% 9.49% 14.14% 0.95% 1.86% 2.43% 4.08% 8.01% 12.50% 8.37% 15.54% 24.37% 2.16% 4.01% 5.59% 9.21% 17.88% 27.84% 19.02% 35.22% 48.91% 8.97%. Mulr 1.7 1.65 1.57 1.7 1.6 1.53 1.6 1.55 1.53 1.5 2.6 2.5 2.47 2.5 2.35 2.4 2.5 2.4 2.37 4.8 4.65 4.01 4.1 4.05 4.17 4.2 3.9 4.01 10.8 10.1 9.33 9.3 8.95 9.3 9.6 8.85 8.17 4.41.

(7) 5. Conclusion The existing search engines all put the same emphases on submitted query terms combined by the Boolean expression. However, users may put different emphasis on each query term. That is, users should be allowed to assign different weight to each query term for their own search purpose. For such a case, the system must re-calculate the new score for each returned web page according to the new given weights. Because typical search engines did not return sufficient information for a system to re-calculate the new score, the re-calculating task will be time-consuming. Moreover, in a real situation, most of users usually only view the top K web pages from those HN returned web pages. Therefore, in this paper, we have proposed the ESA method to solve this problem. By our ESA method, the system does not need to re-scan the whole returned pages and can easily inform users that the top K interesting web pages they request will be among the top R web pages (K ≤ R ≤ HN). Moreover, if users want to get the actual ranking of the top K interesting web pages, the system has only to re-calculate the top R returned web pages, instead of the whole HN returned web pages. The evaluated results in Section 4 had proved the efficiency of our ESA method in the simulated environments. In the future research directions, we will extend our technique to metasearch [7, 8, 9, 10, 13, 16]. And, we will consider the personal factor [2] to improve the efficiency of our method.. 6. References [1] L. Allison, D.L. Dowe, G. Pringle, “What is a Tall Poppy Among Web Pages?” The Seventh International WWW Conference, Brisbane, Australia, April14-18, 1998. [2] W.P. Birmingham, E.J. Glover, S. Lawrence, C.G. Lee, “Architecture of a Metasearch Engine that Supports User Information Needs,” Proceedings of the eighth international conference on Information knowledge management, pp. 210 – 216, 1999. [3] S. Brin, L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [4] S. Chaudhuri, L. Gravano, “Evaluating Top-K Selection Queries,” VLDB'99, pp. 397- 410, 1999. [5] J. Cho, H.G. Molina, L. Page, “Efficient Crawling Through URL Ordering,” Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998. [6] S.F. Dennis, “The Design and Testing of a Fully Automatic Indexing-Searching System for Documents Consisting of Expository Text,” in Information Retrieval: A Critical Review, G. Schecter, editor, Thompson Book Co., Washington, D.C., pp.67-94, 1967. [7] D. Dreilinger, A.E. Howe, “Experiences with Selecting Search Engines Using Metasearch,” ACM Transaction on Information Systems, 15(3), pp. 195-222, 1997. [8] O. Etzioni, E. Selberg, “MultiService Search and Comparison Using the MetaCrawer,” The Fourth International Web Conference (WWW 95). Boston, USA, December 11-14, 1995.. [9] S. Gauch, M. Gomez, G. Wang, “Profusion: Intelligent Fusion from Multiple, Distributed Search Engines,” Journal of Universal Computing, SpringeVelag, 2 (9), pp. 637-649, 1997. [10] L. Gravano, Y. Papakonstantinou, “Mediating and Metasearching on the Internet,” IEEE Bulletin of Data Engineering, 21 (2), pp. 28-36, 1998. [11] K.S. Jones, “A statistical Interpretation of Term Specificity and Its Application in Retrieval,” Journal of Documentation, Vol. 28, No. 1, March, pp. 11-20, 1972. [12] F.W. Lancaster, A.J. Warner, “Information Retrieval Today, “ Arlington: Information Resources Press, 1993. [13] S. Lawrence, C.G. Lee, “Accessibility of Information on the Web,” Nature, 4OO(July 8), pp. 107-109,1999. [14] D.L. Lee, B. Yuwono, “A World Wide Web Resource Database System,” IEEE Transaction on Knowledge and Data Engineering, 8 (4) , pp. 548-554, 1996. [15] D.L. Lee, B. Yuwono, “Search and Ranking Algorithm for Locating Resources on the World Wide Web,” in: Proc.12th int’l Conf. Data Engineering, pp. 164-171, 1996. [16] K.L. Liu, W. Meng, N. Rishe, W. Wu, C. Yu, “Estimating the Usefulness of Search Engines,” Proceedings of the 15th International Conference on Data Engineering, pp. 146 –153, 1999. [17] M. Marchiori, “The Quest for Correct Information on the Web: Hyper Search Engines,” The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997. [18] M.L. Mauldin, “Lycos: Design Choices in An Internet Search Service,” IEEE Expert, (January-February): 8-11,1997. [19] M. McGill, G. Salton, “Introduction to Modern Information Retrieval,” New York: McGraw -Hill Book Company, 1983. [20] W. Meng, C. Yu, “Principles of Database Query Processing for Advanced Applications,” Morgan Kaufmann, San Francisco, 1998. [21] G. Salton and C.S. Yang, “On the Specification of Term Values in Automatic Indexing,” Journal of documentation, Vol. 29, No 4, December, pp.351-372, 1973. [22] C. Schwartz, “Web search engines.” Journal of the American Society for Information Science 49 (12), pp. 973-982. [23] M.E. Senko, “File Organization and Management Information Systems,” Chapter 4, Annual Review of Information Science and Technology, C. Cuadra, editor, Vol. 4, Encyclopaedia Britannica, Chicago, Illinois, pp. 111-143, 1969. [24] M.A. Sheldon, B. Vélez, R. Weiss, “HyPursuit: A Hierarchical Network Search Engine that Exploits Content-link Hypertext Clustering,” Proceedings of the Seventh ACM Conference on Hypertext, pp. 180 – 193, 1996..

(8)