The Matching Phase of Simple BWI Method - Simple Bit-wise Indexing Method

Chapter 3 Simple Bit-wise Indexing Method

3.3 The Matching Phase of Simple BWI Method

Calculating the similarities between a query and saved records is a time-consuming task. A two-phase matching approach, called the Similar-records-seeking algorithm, is thus proposed here to reduce the matching time.

It includes the relevant-records-retrieving phase and the similarity-computing phase. In the first phase, all irrelevant records are filtered out to avoid calculation of their similarities. The time of calculating the similarities of useful saved records can then be decreased. The similarities of the query with remaining saved records are then computed in the similarity-computing phase. The algorithm is described as follows.

Algorithm 3.3 - Similar-records-seeking algorithm :

Input : A bit-wise index matrix TBWI and a new query RN.

Output : A set of similar record Rc with their similarity degrees with RN.

Step 1: Use the bit-wise index creation algorithm (Algorithm 3.2) to get the index BWIN of the new query RN according to the condition part of the query.

Step 2: Initialize the counter j to 1 and Rc to an empty set.

Step 3: For each BWIj in TBWI, do the following sub-steps (1<j≤|R|):

Step 3.1: Call the search-relevant-records algorithm (Algorithm 3.4) to compute the relevance degree rdij between BWIN and BWIj.

Step 3.2: If rdij=0, ignore the record Rj and go to Step 3.5.

Step 3.3: Call the similarity-computing algorithm (Algorithm 3.6) to compute the similarity simj between RN and Rj.

Step 3.4: Add record Rj with its similarity simj to Rc.

Step 3.5: Add 1 to j.

Step 4: Sort the results in Rc in descending order of their similarities.

Step 5: Output Rc.

A saving record is relevant to a new query that will be transformed to a desired bit-wise index via bit-wise index creation algorithm. If they have at least one same

attribute value, the saving record is then similar with the new query in a certain degree.

The bits in the corresponding positions of the matched attributes should be set as "1" in their bit vectors. This can easily be found by using the ‘AND’ bit-wise operation to compare the two bit vectors. The following Search-relevant-records algorithm is thus proposed to achieve this purpose.

Algorithm 3.4 - Search-relevant-records algorithm :

Input: The bit-wise indexing vector BWIN of a new query R N and the index BWIj

of a saved record Rj in R.

Output: The relevant degree rdij between RN and Rj.

Step 1: Use the ‘AND’ bit-wise operation on BWIN and BWIj and store the result as rdij, which is also a bit string.

Step 2: Return rdij.

Since the ‘AND’ bit-wise operation is fast, the Search-relevant-records algorithm selects relevant saved records quickly. If rdi is zero, then the saved record is thought of as irrelevant and will be filtered out.

After all relevant saved records have been retrieved, the similarities between the

query condition and them are computed. As mentioned above, a matching function based on a weighted sum of matched attributes is defined to calculate the similarity degrees. Each attribute has its own weight. Since a record has only one value for an attribute, at most one bit in the bit string rdi is set for each attribute after the Search-relevant-records algorithm is executed. Accordingly, a special bit-wise vector,

called the Mask Vector, is proposed to help compute similarities. Let <1> be the string of length α with all bits being 1 and <0> be the string of length α with all bits being 0.

The definition of the Mask Vector is shown below.

DEFINITION 3.8 - Mask Vector:

A bit-wise indexing mask vector Mask is a set of Maskk, where 0 < k ≤ r and r is the number of attributes. Each Maskk, denoting the mask vector of attribute Ak, is a

concatenation of r bit strings as Maskk=

∑=

By applying the 'AND' operation on Maskk and the bit-wise vectors rdi’s generated from the search-relevant-records algorithm, the similarities between a query and a saved record for attribute Ak can easily be found by the following

∑

Several saved records may have the same similarity with a new query as long as they have the same attributes matched. This is especially common when the numbers of possible values for attributes are large. For this situation, the cost for calculating similarities of saved relevant records can be reduced if all possible similarities are pre-computed and stored into the Similarity Mapping List. Each element in the Similarity Mapping List is a similarity value for some attributes matched. Thus, the

similarity of a saved record with a new query for known attributes matched can easily be found from the list, instead of from calculation by the above formula. The Similarity Mapping List is formally defined as follows.

DEFINITION 3.9 - Similarity Mapping List:

Let L be a Similarity Mapping List and Li be an element in L with an index value i, which is determined from the attributes matched, 1≤i≤2^|r|-1. Let i be represented as a binary code bi1bi2…bir, with bij=1 if the j-th attribute is matched and bij=0 otherwise,

1≤j≤r. The value of Li is thus

∑

j j r

j ij

W W b

1 .

Algorithm 3.5 - Similarity-mapping-list creation algorithm :

Input: Weights of attributes W1, W2, …, Wr of R.

Output: A similarity mapping list L.

Step 1: Initialize the counter i to 1 and the list L to be empty.

Step 2: For each i, 1≤i≤2^|r|-1, do the following sub-steps:

Step 2.1: Encode i into a binary string <bi1bi2…bir>.

Step 2.2: Calculate the similarity degree Li by the formula in Definition 3.9.

Step 2.3: Put Li into the list L with index i.

Step 3: Return L.

After the Similarity Mapping List has been built, the similarity of each saved record and a new query can be quickly found by the following algorithm.

Algorithm 3.6 - Similarity-computing algorithm :

Input: The relevant degree rdij of record Rj with a new query, the Mask Vector, and the Similarity Mapping List L.

Output: The similarity of Rj with a new record.

Step 1: Initialize a zero binary string of length r.

Step 2: For each i, 1 ≤ i ≤ r, set the i-th position in the string to 1 if the result of using the ‘AND’ bit-wise operation on Maski and rdij is not all 0.

Step 3: Transform the binary string into an integer j.

Step 4: Get Lj from the Similarity Mapping List.

Step 5: Return Lj.

Since the Similarity Mapping List and the Mask Vector are constructed in the pre-processing step, and since only the ‘AND’ bit-wise operations are executed on Mask Vectors and bit-wise vectors of relevant records in the Similarity-computing

algorithm, the computational time for finding the similarities can thus be significantly

reduced.

EXAMPLE 3.5:

Continuing from Example 3.4, the BWIN of a new query RN, which is {Toolid=6210, Name=AWOX01, Location=FAB1}, is <10000 10000 100>. Also assume that weight W1, W2 and W3 are set to 0.33. Each BWIj in TBWI in Table 3.2 is processed as follows.

• For BWI1, The relevant degree rdi1 between BWI1 and BWIN is found as <10000

10000 100> by the Search-relevant-records algorithm. Since more than one bit in rdi1 are "1", Record 4 is a relevant record. Its similarity is found as 1. Record 1 is

then a relevant record.

• For BWI3, BWI4 and BWI5, The relevant degree rdi between these records and BWIN

is found as <00000 00000 000>. Since all the bits in these rdis are "0", Records 3, 4 and 5 are thus filtered out.

After the relevant records are sorted in decreasing order of similarities, the results are shown is Table 3.3.

Table 3.3: Two relevant records and their similarities Relevant Record Record 1 Record 2

Similarity 1 0.333

在文檔中知識系統中快速索引機制之研究 (頁 52-59)