• 沒有找到結果。

Analysis and Experiments of Simple BWI Method

Chapter 3 Simple Bit-wise Indexing Method

3.4 Analysis and Experiments of Simple BWI Method

As mentioned above, the proposed matching algorithms include two phases to reduce the computational time. At the retrieving-relevant-records phase, irrelevant prior records are filtered out. Thus, only the similarities between relevant prior records and the new query are computed at the similarity-computing phase. Assume that the number of records in the target table is N and the average filtering percentage is M. The

time needed to retrieve relevant saved records and to calculate their similarities in seek time in the Similarity Mapping List. If no filtering is performed, the time needed to calculate their similarities in STEP 3 of Algorithm 3.5 is analyzed as:

Timewithout filtering(N×tand +N×(r×tand)+N×tc)

=N×((1+r)×tand +tc).

The performance due to the filtering is then:

)

The proposed method can indeed improve the performance of query although some extra storage spaces are required. These storage spaces are used for storing the bit-wise indexes and the Similarity Mapping List. The sizes of extra storage spaces

required in our method are analyzed as follows.

z The storage space required for the bit-wise indexes TBWI =

= is the number of bits used for attribute Ai, r is the number of attributes, and |C| is the number of records in warehouse. For example, assume that there are 100000 records in a warehouse and 16 attributes to describe each record. Also assume each attribute has 4 possible values. The storage space required for TBWI =

= storage space required for the Mask Vector = (

=

is the storage space required for storing a similarity value. Assume that f is a 4-byte real number. For the above example, the storage space required for the Similarity Mapping List L = 4×(216-1) bytes = 262140 bytes ≅ 256 K bytes.

Note that the size of the extra storage space required for the Similarity Mapping

List is exponential to r. Therefore, the Similarity Mapping List is not suitable for

domains with large numbers of attributes.

The result of comparing the Simple BWI indexing method with the Bitmap indexing method is shown in Figure 3.1.

Figure 3.1: Simple BWI indexing method v. s. Bitmap indexing method

We can see that Simple BWI method is faster than Bitmap indexing method, the

reasons are:

z In retrieving relevant cases phase, the Bitmap indexing technology is not suitable

for retrieving similar cases. For example, when a new case comes, the Bitmap indexing method needs to check all possible attribute combination vectors in order

to retrieve relevant prior cases. The more attributes check, the more time it needs.

z In similarity measurement phase, the Bitmap indexing method needs to check the

0 50 100 150 200 250 300 350 400

500 3500 6500 9500 12500 15500 18500 21500 24500 27500 30500 33500 36500 39500 42500 45500 48500 51500 54500 57500 60500 63500 66500 69500 72500 75500 78500 81500 84500 87500 90500 93500 96500 99500

number of records

retrieval time

BWI-CBR Bitmap-CBR

all corresponding position in all possible attribute combination vectors, especially when the number of attribute of query needs or the number of records in the table T are large. The waste time is lengthy and unbearable. Therefore, the BWI

indexing method is faster than that in Bitmap indexing method when the similarity computing is needed.

Also, we compare the Simple BWI indexing method with single processor and the parallel Simple BWI indexing with multiple processors for showing the improvement of the performance. In Figures 3.2 and 3.3, the dual CPUs parallel Simple BWI indexing method can increase the performance about 1.6 times and the quad CPUs parallel Simple BWI indexing method can increase the performance about 3.2 times. It is obvious that Simple BWI indexing method is quite suitable for parallelization since the bit-wise indexing matrix of the proposed method can be separated into several independent sub-matrixes and these sub-matrixes is almost balanced. Therefore, when the Simple BWI indexing method is built in a multiple CPU machine, the workload can be easily shared into each processor and assure that the workload of each processor is almost balanced.

Figure 3.2: Speed-up of parallel BWI indexing on two processors machine.

0 0.5 1 1.5 2 2.5 3 3.5 4

500 5500 10500 15500 20500 25500 30500 35500 40500 45500 50500 55500 60500 65500 70500 75500 80500 85500 90500 95500

Number of records

efficiency

Figure 3.3: Speed-up of parallel BWI indexing on four processor machine.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

500 5500 10500 15500 20500 25500 30500 35500 40500 45500 50500 55500 60500 65500 70500 75500 80500 85500 90500 95500

number of records efficiency

Chapter 4

Advanced Bit-wise Indexing Method

In the chapter, the advanced bit-wise indexing method, including Encapsulated bit-wise indexing method and Compacted bit-wise indexing method, are introduced,

including the definitions and algorithms of indexing and matching phases in these two bit-wise indexing methods are proposed in the following sections.

4.1 Encapsulated Bit-wise Indexing Method

4.1.1 General Assumptions and Notations for Encapsulated BWI Technology

As we can see, the bit length of bit-wise indexing vector for some attribute depends on the number of its distinct values. When the attribute contains a large amount of distinct values, the size of its corresponding bit-wise indexing vector becomes hugely large, when the required bit-length is too large to handle, partitioning

the bit-length to several levels seems helpful for this issue. There is a threshold (Th) which can be used to determine whether the encapsulated bit-wise indexing technology is applied or not. That is, when the total length of bit vectors is larger than this threshold (Th), the algorithm is applied on. The following notations and definitions are given to describe the encapsulated bit-wise indexing method.

NOTATION 4.1 :

eli = the maxima encapsulated level of attribute Ai

j

eii = the bit length of j-th encapsulated level of attribute Ai

eii = the total bit length of BWI index for the given attribute Ai, eii=

= eli

j j

eii 1

ei = the total bit length of BWI index for the given record Ri, ei=

= r

j

eij 1

Th= the threshold for separating bit-length to levels boundary

We propose an Encapsulated bit-wise indexing method on data warehouse to achieve the goal of saving storage and accelerating user query procedure. This method includes two phases. One is creating indexes phase, and the other is querying phase.

The indexing phase transforms the contents of table into a bit vector matrix (in here, called a matrix of bit-wise indexes), and the query phase is retrieving records to answer the query statements as soon as possible.

4.1.2 The Indexing Phase of Encapsulated BWI Method

The indexing phase includes Encapsulated level calculating Algorithm, Encapsulated BWI attributes index creating Algorithm and Encapsulated BWI matrix

of bit-wise indexes creating Algorithm. The Encapsulated level calculating Algorithm

calculates an encapsulated level of each attribute for creating the corresponding bit-wise indexes, the Encapsulated BWI Bit-wise indexes creating Algorithm creates corresponding BWI index of matrix of multi-level bit-wise indexes. The Encapsulated BWI Matrix of bit-wise indexes creating Algorithm creates bit vectors matrix of data

warehouse. These algorithms and examples are shown as follows.

In Encapsulated bit-wise indexing method, there are several methods to decide the partition size of indexing vector. Here, we use square root to calculate the compact size

of indexing vector. For instance, when n bits are required to represent a specify attributes in simple bit-wise indexing method, 2⎡ n⎤ bits are required by two levels

indexing vectors respectively in two-level encapsulated bit-wise indexing method. For example, assume that attribute A uses 10,000 bits to be the indexing vector when simple bit-wise indexing method is applied. There are 200 bits are required in two-level condensable bit-wise indexing method. As we can see, the used bits can be

largely reduced to 1/50. When the condensable bit-wise indexing method is used in the data warehousing, the used bits in much more compact then using bitmap and simple bit-wise indexing methods. Therefore, we propose the following definitions and algorithms.

Algorithm 4.1 - Encapsulated level calculating Algorithm – Square Root :

Input: Table T of data warehouse and threshold Th.

Output: The corresponding eli andeiijfor all attribute in A.

Step 1 : Let eli = 1,eii1=α(i) and ei =

∑∑

= = r

j el

k k j

i

ei

1 1

,for 1 ≤ i ≤ r.

Step 2: If ei > Th, do the following sub-steps; otherwise go to Step 3.

Step 2.1: If not exist a eiij where and eiij> 2×⎡ eiij ⎤ with minima eli and j, Return false for Th limitation

Step 2.2: Let eli=eli +1, ei=ei-eiij+2×⎡ eiij ⎤,eiij=⎡ eiij ⎤, eiieli=⎡ eiij ⎤ and go to Step 2.

Step 3: Return the corresponding eli andeiijfor all attribute in A.

EXAMPLE 4.1:

Figure 4.1 shows a flat target table T including attribute set A = < LotID, StepID, ToolID, Yield >, four attributes and 23 records. The attribute values domains of Cid,

Name, Gender, and City are V1=< 0001, 0002, 0003, ….., 00022, 00023 >, V2=<

PS_1, PS_2, PS_3, PS_4, PS_5 >, V3=< AWOX11, AWOX12, AWOX13, AWOX14, AWOX21, AWOX31, AWOX32, AWOX33, AWOX34, AWOX35, AWOX36, AWOX41, AWOX42, AWOX43, AWOX51>, and V4=< 92.1, 92.2, 92.3, 93.1, 93.2, 94.3, 94.4, 94.5, 94.6, 95.5, 95.6, 95.7, 95.7, 95.8, 96.1, 96.5, 99.1, 99.3>, respectively.

It can be easily seen that the number of distinct values of LotID, StepID, ToolID and Yield are ei1= 23, ei2=5, ei3=15 and ei4=18, respectively.

LotID StepID ToolID Yield LotID StepID ToolID Yield 1 0001 PS_1 AWOX11 92.1 13 0013 PS_3 AWOX34 93.1 2 0002 PS_1 AWOX11 92.3 14 0014 PS_3 AWOX35 94.4 3 0003 PS_1 AWOX12 92.2 15 0015 PS_3 AWOX36 95.8 4 0004 PS_1 AWOX13 99.1 16 0016 PS_4 AWOX41 93.2 5 0005 PS_1 AWOX14 99.3 17 0017 PS_4 AWOX41 94.5 6 0006 PS_2 AWOX21 93.1 18 0018 PS_4 AWOX41 95.6 7 0007 PS_2 AWOX21 94.5 19 0019 PS_4 AWOX42 94.6 8 0008 PS_2 AWOX21 95.6 20 0020 PS_4 AWOX42 94.3 9 0009 PS_2 AWOX21 95.7 21 0021 PS_4 AWOX43 95.7 10 0010 PS_3 AWOX31 96.1 22 0022 PS_5 AWOX51 95.5 11 0011 PS_3 AWOX32 92.2 23 0023 PS_5 AWOX51 96.5 12 0012 PS_3 AWOX33 92.3

Figure 4.1: An example of Flat target table T

ei =

∑∑

= = 4

1 1

1 j k

k

eij = 23 + 5 + 15 + 18 = 61 bits be the length of an encoded record.

The threshold Th is set to 35 initially. Also, all levels of attributes have initial

value 1, e.g., el1 = 1, el2 = 1, el3 = 1, el1 = 1 and the vector length of all attribute are thus ei11=23, ei12=5, ei31=15 and ei14=18.

Since ei > Th, the attribute LotID with the max length of indexing string ei11=23 is chosen for length reducing. Therefore, the encapsulation level of attribute LotID el1=1+1=2, ei11=ei12=⎡ 23⎤=5, ei1=10 and the total length of vectors ei is reduced to 48 (61-23+10). However, the length is still larger than the threshold Th. The attribute Yield is then chosen. Therefore, the encapsulation level of attribute Yield el4=1+1=2,

1

ei4=ei42 =⎡ 18 ⎤=5, ei4=10 and the total length of vectors ei is reduced to 40 (48-18+10). Since the length is still larger than the threshold Th. The attribute ToolID

is then chosen. Therefore, the encapsulation level of attribute ToolID el3=1+1=2,

1

ei3=ei32=⎡ 15⎤=4, ei4=8 and the total length of vectors ei is reduced to 33 (40-15+8).

Finally, the total length of vector is reduced to 33 and the algorithm stops.

As mentioned in Definition 3.5, the user can provide a suitable transforming function for the continuously type attributes, including numeric and data-time type. In the encapsulated BWI method, user can provide an eli-level transforming functions fi

for the attribute Ai in order to close for the physical meaning than encapsulated BWI itself only. fi=< fi1,fi2,…, fieli> where the number of value domain of fik should

equal to eiik. The definition is shown below:

DEFINITION 4.1 – Encapsulated BWI bit-wise indexing vector of an attribute where Type(Ai)≠S :

The bit-wise indexing vector Bi of the i-th attribute for the record Rj in T is set of bit strings. Bi=<Bi1, Bi2, …, Bieli >, where fi=< fi1, fi2,…, fieli> is the eli-level of function that given by user Bik=bj1bj2k

jei

b , where bjl=1 if fik(Vi(k))=l and bjl=0

otherwise.

EXAMPLE 4.2 :

Assume that the second attribute Recipe_degree is <10, 12, 14, 16, 18, 20, 22, 24>.

Also, after the Encapsulated level calculating Algorithm – Square Root executed, and the ei12 ,ei12 and eli are all set to 2. The attribute value of Recipe_degree in the second record is 16. Also, user gives the following two-level (fl=2) function f21and f22 .

1

f2 (Vi(k))= ⎣Vi(k)/10⎦

2

f2 (Vi(k))= ⎣(Vi(k) – ( f21(Vi(k))×10))/5⎦+1

According to the Definition 4.1, bit-wise indexing method uses the 4 bits as the bit vector of the index in which every bit represents a specific value of the index attribute Recipe_degree.

1

B2:

1

f2 (Vi(k))=1 f21(Vi(k))=2

1 0

2

B2 :

2

f2 (Vi(k))=1 f22(Vi(k))=2

0 1 Therefore, we get B2=B21 B22="1001"

Algorithm 4.2 - Encapsulated BWI bit-wise indexes creating Algorithm :

Input: A record Ri.

Output: A bit-wise index BWIi of Ri.

Step 1: Create a bit-wise vector BWIi of length 0.

Step 2: Repeat the following sub-steps for each attribute Aj until all attributes are processed.

Step 2.1: If Type(Aj) ≠ S and fi∅, go to Step 2.2, else let m=n if Vj(i)=Vjn, create a bit-wise vector Bi with 0 and repeat the following sub-steps for each

encapsulated level elk until all encapsulated levels are processed Step 2.1.1: Let B’=b1b2k

eij

b to a all-zero string with lengtheikj Step 2.1.2: If k≠elk, go to Step 2.1.3, else if the m=0, set k

eij

b =1 and set bm=1

otherwise, go to Step 2.1.5.

Step 2.1.3: Let o = ⎣m/p

=elkj+1eiip⎦, if o=eikj , set bo=1 and set bo+1=1 otherwise.

Step 2.1.4: Set m=m-(o×p

=elkj+1eiip )

Step 2.1.5: Concatenate the bit strings Bj and B into Bj.

Step 2.2: If Type(Aj) ≠ S and fi≠∅, for each Bkj , do the following sub-steps Step 2.2.2: Set bjl=1 if fik(Vi(k))=l and bjl=0 otherwise.

Step 2.2.3: Concatenate the bit strings Bj and Bkj into Bj. Step 3: Concatenate the bit strings B1, B2,…, and Br into BWIi. Step 4: Return the vector BWIi.

Algorithm 4.3 - Encapsulated BWI Matrix of bit-wise indexes creating Algorithm :

Input: Table T of the data warehouse.

Output: The TBWI of the data warehouse.

Step 1: Create an empty bit-wise indexes matrix TBWI for table T.

Step 2: Call Encapsulated level calculating Algorithm – Square Root (Algorithm 4.1) to get the corresponding els and eis.

Step 3: Repeat the following sub-steps for each record Ri until all records are processed.

Step 3.1: Use the Encapsulated BWI bit-wise index creation algorithm (Algorithm 4.2) to get the index BWIi of Ri.

Step 3.2: Add BWIi into TBWI. Step 4: Return TBWI.

After a bit-wise index matrix is built, bit-wise operations can easily be used to retrieve desired record for the new coming queries.

EXAMPLE 4.3:

Assume that a Target Table T containing 23 records is shown in Figure 4.2 and the user gives the following two-level (fl=2) function f21and f22of attribute Yield .

1

f2 (Vi(k))=⎡(Vi(k)-90)/2⎤

2

f2 (Vi(k))= ⎡((Vi(k) – (90+( f21(Vi(k))-1) ×2)) /0.4) ⎤

The bit-wise indexes for the above records are shown in Table 4.1.

Table 4.1: The TBWI of 23 records in Figure 4.2

BWI LotID StepID ToolID Yield

eis ei11 ei12 ei12 ei31 ei32 ei14 ei42 BWI1 10000 10000 10000 1000 1000 01000 10000 BWI2 10000 01000 10000 1000 1000 01000 10000 BWI3 10000 00100 10000 1000 0100 01000 10000 BWI4 10000 00010 10000 1000 0010 00001 00100 BWI5 10000 00001 10000 1000 0001 00001 00010 BWI6 01000 10000 01000 0100 1000 01000 00100 BWI7 01000 01000 01000 0100 1000 00100 01000 BWI8 01000 00100 01000 0100 1000 00100 00010 BWI9 01000 00010 01000 0100 1000 00100 00001 BWI10 01000 00001 00100 0100 0100 00010 10000 BWI11 00100 10000 00100 0100 0010 01000 10000 BWI12 00100 01000 00100 0100 0001 01000 10000 BWI13 00100 00100 00100 0010 1000 01000 00100 BWI14 00100 00010 00100 0010 0100 00100 01000 BWI15 00100 00001 00100 0010 0010 00100 00001 BWI16 00010 10000 00010 0010 0001 01000 00010 BWI17 00010 01000 00010 0010 0001 00100 01000 BWI18 00010 00100 00010 0010 0001 00100 00010 BWI19 00010 00010 00010 0001 1000 00100 01000 BWI20 00010 00001 00010 0001 1000 00100 10000 BWI21 00001 10000 00010 0001 0100 00100 00001 BWI22 00001 01000 00001 0001 0010 00100 00010 BWI23 00001 00100 00001 0001 0010 00010 01000

4.1.3 The Matching Phase of Encapsulated BWI Method

Calculating the similarities between a query and saved records is a time-consuming task. A two-phase matching approach, called the Encapsulated BWI Similar-records-seeking algorithm, is thus proposed here to reduce the matching time.

It includes the Encapsulated BWI relevant-records-retrieving phase and the Encapsulated BWI similarity-computing phase. In the first phase, all irrelevant records

are filtered out to avoid calculation of their similarities. The time of calculating the similarities of useful saved records can then be decreased. The similarities of the query with remaining saved records are then computed efficiently in the similarity-computing phase. The algorithm is described as follows.

Algorithm 4.4 - Encapsulated BWI Similar-records-seeking algorithm :

Input : A bit-wise index matrix TBWI and a new query RN.

Output : A set of similar record Rc with their similarity degrees with RN.

Step 1: Use the Encapsulated BWI bit-wise index creation algorithm (Algorithm 4.2) to get the index BWIN of the new query RN according to the condition part of the query.

Step 2: Initialize the counter j to 1 and Rc to an empty set.

Step 3: For each BWIj in TBWI, do the following sub-steps (1<j≤|R|):

Step 3.1: Call the Encapsulated BWI search-relevant-records algorithm (Algorithm 4.5) to compute the relevance degree rdij between BWIN

and BWIj.

Step 3.2: If rdij=0, ignore the record Rj and go to Step 3.5.

Step 3.3: Call the Encapsulated BWI similarity-computing algorithm (Algorithm 4.7) to compute the similarity simj between RN and Rj.

Step 3.4: Add record Rj with its similarity simj to Rc.

Step 3.5: Add 1 to j.

Step 4: Sort the results in Rc in descending order of their similarities.

Step 5: Output Rc.

Even the encoding procedure of BWI index in Encapsulated BWI method is different than the Simple one, it still can easily be found by using the ‘AND’ bit-wise operation to compare the two bit vectors. The following Encapsulated BWI Search-relevant-records algorithm is thus proposed to achieve this purpose.

Algorithm 4.5 - Encapsulated BWI Search-relevant-records algorithm :

Input: The bit-wise indexing vector BWIN of a new query R N and the index BWIj

of a saved record Rj in R.

Output: The relevant degree rdij between RN and Rj.

Step 1: Use the ‘AND’ bit-wise operation on BWIN and BWIj and store the result as rdij, which is also a bit string.

Step 2: Return rdij.

Since the ‘AND’ bit-wise operation is fast, the Search-relevant-records algorithm selects relevant saved records quickly. If rdi is zero, then the saved record is thought of as irrelevant and will be filtered out. Since the properties of Encapsulated BWI mode, if rdi has some ‘1’ bits, it does not mean that the saved record is relevant. As mentioned above, a matching function based on a weighted sum of matched attributes is defined to calculate the similarity degrees. As the same with Simple BWI method. the Mask Vector and the Similarity Mapping List are used in Encapsulated BWI method

and then be defined at Definition 4.2 and 4.3.

DEFINITION 4.2 - Encapsulated BWI Mask Vector :

A Encapsulated BWI bit-wise indexing mask vector eMask is a set of eMaskk, where 0 < k ≤

DEFINITION 4.3 - Encapsulated BWI Similarity Mapping List :

Let L be an Encapsulated BWI Similarity Mapping List and Li be an element in L

with an index value i, which is determined from the attributes matched, 1≤i≤ =

r

Algorithm 4.6 - Encapsulated BWI Similarity-mapping-list creation algorithm :

Input: Weights of attributes W1, W2, …, Wr of R.

Output: A similarity mapping list L.

Step 1: Initialize the counter i to 1 and the list L to be empty.

Step 2: For each i, 1≤i≤ =

r i eli

2 1 -1, do the following sub-steps:

Step 2.1: Encode i into a binary string <bi1bi2

Step 2.2: Calculate the similarity degree Li by the formula in Definition 4.3.

Step 2.3: Put Li into the list L with index i.

Step 3: Return L.

After the Similarity Mapping List has been built, the similarity of each saved record and a new query can be quickly found by the following algorithm.

Algorithm 4.7 - Encapsulated BWI Similarity-computing algorithm :

Input: The relevant degree rdij of record Rj with a new query, the Mask Vector, and

the Similarity Mapping List L.

Output: The similarity of Rj with a new record.

Step 1: Initialize a zero binary string of length r.

Step 2: For each i, 1 ≤ i ≤

= r i eli

1

, set the i-th position in the string to 1 if

AND(eMaski, rdij) = AND(eMaski, BWIN).

Step 3: Transform the binary string into an integer j.

Step 4: Get Lj from the Similarity Mapping List.

Step 5: Return Lj.

EXAMPLE 4.4:

Continuing from Example 4.3, the BWIN of a new query RN, which is {StepID=PS_1, ToolID=AWOX13, Yield=99.1}, is < ei11 =00000 ei12 =00000

1

ei2=10000 ei31=1000 ei32=0010 ei14=00001 ei42=00100>. Also assume that weight W2, W3 and W4 are set to 0.4, 0.4 and 0.2, respectively. Each BWIj in TBWI in Table 4.1

is processed as follows.

• For BWI1, BWI2 and BWI3, all the relevant degrees rdi1, rdi2 and rdi3 between BWI1,

BWI1, BWI1 and BWIN are found as <00000 00000 10000 1000 0000 00000 00000>

by the Encapsulated BWI Search-relevant-records algorithm. Since more than one bit

in rdi1 is "1", Records 1, 2 and 3 are possible relevant records. According to the Definition 4.2, the eMask2 = <00000 00000 11111 0000 0000 00000 00000> and eMask3=<00000 00000 00000 1111 1111 00000 00000>. Since the result of AND(eMask2, rdi1) = <00000 00000 10000 0000 0000 00000 00000> is equal to the result of AND(eMask2, BWIN) = <00000 00000 10000 0000 0000 00000 00000> and the result of AND(eMask3, rdi1) = <00000 00000 00000 1000 0000 00000 00000> is not equal to the result of AND(eMask3, BWIN) = <00000 00000 00000 1000 1000 00000 00000>, the similarities of Records 1, 2 and 3 are found as 0.4 via

ALGORITHM 4.7. Record 1, 2, 3 are then the relevant records.

• For BWI4: The relevant degree rdi4 between BWI4 and BWIN is found as <00000

00000 10000 1000 0010 00001 00100> by the Encapsulated BWI Search-relevant-records algorithm. Since more than one bit in rdi1 is "1", Record 4 is a possible relevant record. According to the Definition 4.2, the eMask2 = <00000 00000 11111 0000 0000 00000 00000>, eMask3=<00000 00000 00000 1111 1111

00000 10000 1000 0010 00001 00100> by the Encapsulated BWI Search-relevant-records algorithm. Since more than one bit in rdi1 is "1", Record 4 is a possible relevant record. According to the Definition 4.2, the eMask2 = <00000 00000 11111 0000 0000 00000 00000>, eMask3=<00000 00000 00000 1111 1111