SYMBOLIC GRAY CODE AS A PERFECT MULTIATTRIBUTE HASHING SCHEME FOR PARTIAL MATCH QUERIES

(1)

Symbolic

Gray

Code

as a

Perfect

Multiattribute

Hashing Scheme

for

Partial Match

Queries

C.C. CHANG,R.C.T. LEE,MEMBER,IEEE, AND M. W. DU,MEMBER,IEEE

Abstract-In this paper, we shall show that the symbolicGraycode hashing mechanism isnotonlygoodfor best matching,butalsogood for partial match queries. Essentially, weshallpropose anewhashing scheme, called bucket-oriented symbolic Gray code,whichcanbeused toproduce anyarbitrary Cartesian product file, which has been shown to begood for partial match queries. Many interesting properties of this newmultiattributehashing scheme,including the property that it isaperfecthashing scheme, have been discussed and proved.

Index Terms-Bucket-orientedsymbolic Gray code, Cartesian product file,multiattribute fileorganization, partialmatch query,perfect hash-ing, symbolicGraycode.

I. THE PARTIAL MATCHING PROBLEM

IN

this paper, we are concerned with partial match query systems

[1], [3], [5],

[6],

[10], [18]-[21],

[23].

We assume that we aredealingwith amultiattribute file consisting ofasetofmultiattribute records.

Each

recordischaracterized by attributes

Al,

A2, -- ,AN. A partial match query is a

query of the

following

form: retrieveall records where

Ai,

=

all,

* * *,

Aik

=

aik

where O< k<N.

We shall assumethatalloftherecordsaredividedinto buck-ets and stored in disks. Each time a partial match query is processed, one or morediskaccesses areperformed. Since the disk accessing is much more time-consuming than any other processing in the internal mainmemory, we shallmeasurethe performance of our file system by the number of buckets necessarytobe examined.

Let us consider Tables I and II. In both tables, a query

(A

I =a,A2 = *) denotesapartial matchquerywhich retrieves

all of the records withA equal toaandA 2 canbeany

value,

i.e., a don't care condition. It can be seen that the average number of buckets to be examined, over all possible partial match queries, is 2forthe

file

systeminTableII and4 for file system inTableI.

Thus, our multiattribute file system design problem for partlal match queries can be stated as follows: given a set of multiattribute records, the problem is to arrange the records in such a way that theaveragenumberof bucketstobe exam-ined, overallpossible partial matchqueries, isminimized.

Unfortunately, a solution to the above stated problem is still at large. In other words, given an arbitrary set of multi-attribute records, there is no efficient algorithm to

fimd

an

Manuscript received April 24, 1981; revised July 8, 1981.

C. C.Changand M. W. Du arewith the InstituteofComputer Engi-neering, NationalChiao-Tung University, Hsinchu, Taiwan.

R.C.T. Lee is with theInstituteofComputer andDecision Sciences, National Tsing Hua University, Hsinchu, Taiwan.

TABLE I

(A)ANARRANGEMENTOF16 RECORDSINTOFOURBUCKETS. (B) BUCKETSTO BEEXAMINEDFORALLPOSSIBLE QUERIESFOR(A)

(A) Bucket 1 Bucket 2 Bucket 3 Bucket 4

(a,a) (a,b) (a,c) (a,d)

(b,b) (b,c) (b,d) (b,a)

(c,c) (c,d) (c,c) (c,b)

(d,d) (d,a) (d,b) (d,c)

(B) Queries Buckets to be examined

(a,TM) 1,2,3,4 (b,*) 1,2,3,4 (c,T) 1,2,3,4 (d,*) 1,2,3,4 (T,a) 1,2,3,4 (*,b) 1,2,3,4 (c,C) 1,2,3,4 (*,d) 1,2,3,4 TABLE II

(A)ANOTHER ARRANGEMENTOF 16 RECORDSINTOFOURBUCKETS. (B) BUCKETSTO BEEXAMINEDFORALL POSSIBLEQUERIESFOR(A)

(A) Bucket 1 Bucket 2 Bucket 3 Bucket 4 (a,a) (a,c) Cc,a) (c,c) (a,b) (a,d) (c,b) (c,d) (b,a) (b,c) (d,a) (d,c) (b,b) (b,d) (d,b) (d,d) (B) Oueries Buckets to be examined

(a,*) 1,2 (b,T*) 1,2 (c,*) 3,4 (d,*) 3,4 (M,a) 1,3 (T,b) 1,3 (*c) 2,4 d) 2,4

optimal arrangement of recordsinto buckets.

However,

ifall records are present, under certain conditions, it ispossible to have anoptimalsolution.

We shall elaborate this in the

following

section. II. THE CARTESIAN PRODUCT FILECONCEPT Before presenting the Cartesian productfile concept, let us

assume that each record is characterized

by

N attributesAl,

A2,

*- -

,AN.

Each

Ai

is associated with a set

Di,

which is the domain of

Ai.

The size of domain

Di

is denoted as

qi.

The domain of the file F, consisting of all

possible

records,

is

(2)

thus

DI

XD2 X * * -XDN. The total numberofrecords,orthe

sizeof F, denotedasNR,isq, q2 ...

qN.

Weshallassumethat

the entirefile isdividedinto NB buckets:B1,B2, ,

BNB.

Definition: A file system is called aCartesian product file if all records in every bucket are in the form ofD1SI,

D2S2,

DNsN

where

Di.1

is a subset of

Di.

This bucket is

de-note4das

[sI,s2,

SN

Example 2.1: Let D1 {a, b, c, d} D2 .

LetD1I

=D21 = {a, b} andD12 =D22 = {c, d}. Then thefollowingfile system

is aCartesian product file:

Bucket [1, ] =

DI

XD21 = {(a, a),(a, b), (b, a), (b, b)}

Bucket

[1,2]

=DII

XD22 =

{(a,c),(a,d),(b,c),(b,d)}

Bucket [2, 2] =

D12

XD22 = {(c, c),(c, d), (d, c), (d, d)} Bucket [2,

11

=D12 X

D21l

=

{(c,

a),

(c,

b), (d, a), (d,

b)}.

The reader should note that the above file system isexactly the same as that shown in Table II. This is not accidental. As first pointed out by Lin et al.

[19],

many good file sys-tems, such as those designed by Rivest

[211,

Rothnie and Lozano

[23],

aswell as Liou andYao

[20],

are all Cartesian product files. Aho and Ullman

[1] explored

the problem of designing optimal multiattribute file systems whose

probabili-ties of an attribute being specified are not

equal.

This file systemisalsoaCartesianproduct file. In

[6],

moreproperties of Cartesianproduct fileswerediscussed.

The physicalmeaning of a Cartesianproduct file discussedin Example 2.1 can be vi-sualized by considering Fig. 1 where each dot represents a record. In

Fig. 1,

each bucket corre-sponds to a square. In this case, it is easy to see that this Cartesian product file divides records into natural clusters. If we want to retrieve all records with

A.,

equal toa, only two

buckets

(Bucket

[1,

1] and Bucket

[1,

2]

)

havetobeaccessed.

Similarly,

ifwe want to retrieveall records with

A2

=b, again,

only twobuckets

(Bucket

[1,

1] and Bucket

[2,

1])

have to beretrieved.

If we do notuse the Cartesian product file concept and in-steadwe use the file system shown in Table I, the reader can

verify for himself that within each bucket, records are not similartoone anotherat

all,

asshownin

Fig.

2.

In

[19],

it was pointed out that good multiattribute file systems all exhibit some kind of

clustering

property. That is, within each bucket, records should be similar to one an-other. It just happens that Cartesian product files do cluster similar records together.

Example 2.2: Cartesian product files do not necessarily group records into squares, as shown in

Fig.

2. For the case .in

Example

2.1,

we may also have the following Cartesian

productfile:

DI,

=

{a},

D12 =

{b},

D13 =

{c},

D14 =

{d}.

D21 =D2 =

{a,b,c,d}.

Bucket

[1,1]

=

{(a,a),

(a, b), (a, c), (a,

d)}

Bucket

[2,1]

=

{(b,

a),

(b, b), (b, c), (b, d)}

Bucket

[3,1]

{(c, a), (c, b), (c, c), (c, d)}

Bucket

[4,

1

]

=

{(d,

a),

(d, b), (d, c), (d, d)}.

The above file system is shown in Fig. 3, where each long strip corresponds to a bucket. This file systemperformsvery well if the user

specifies

AI

and very

poorly

if theuser speci-fiesA2 -A2 d C b a a b c d -1

Fig. 1. Simple geometryrepresentationofTableII(A). A2~ d c b a I. I ~ -A a b c d At

Fig. 2. Geometry representation showing that records not similar to

oneanotherarewithin eachbucket.

A2

d

a

Fig. 3.

a b c d

Simplegeometryrepresentationwhichpreferssomeattribute.

Because of the clustering properties of Cartesian product files, they are also

useful

for nearestneighbor searching

[2],

[41,

[9], [14], [17], [22],

[24]. Du and Sobolewski [10] used the Cartesian product files for parallel processing in multiple disksystems.

Given two

records

R1 =

(rI1,

rI2,

,rlN)

and R2 =

(r2I,

r22 ,

r2),

the Hamming distance between

RI

and R2 is

definedasfollows: N

d(Ri,R2)

L6(rj1,r21)

i -1

where

6(r1i,r2

) 1

ifr1i

r2i

=0

ifrli

r2i.

One of the most important properties ofCartesian product files is that it is possible to arrange records in a Cartesian product file such that the Hamming distance

[19]

between

I L I _,_A.

L

(3)

TABLE III

USING INDEX PAIRSTODENOTE EACH BUCKETNUMBEROFTABLEII(A) (a,a) Bucket (ci,b) [1,1]

J(b,

b) (b,a) (b, c) Bucket (b,d) [1,2] (a ,d) i (a,c) (c,c) Bucket (c,d)

[2,21

(d,d) (d,c) (d,a) Bucket (d,b) [2,1] (c, b) (c,a)

USINGINDEX PAIRS

Bucket I

TABLE IV

TODENOTEEACHBUCKET NUMBER

Bucket [4,1] (a,a) (a, b) (a,c) (a,d) (b,d) (b,c) (b,b) (b,a) (c,a) (c,b) (c, c) (c,d) (d,d ) (d,c) (d b) (d,a)

two consecutive records in the file is equal to one. For

ex-ample, consider the two files in Examples 2.1 and 2.2, respec-tively. TheycanbedisplayedinTables III and IV. The reader

may verify that in both files, for any pair ofconsecutive

rec-ords, the Hamming distance between them is equal to one.

Thus Cartesian product files exhibit the consecutive retrieval

propertyadvocatedbyGhosh [16].

Note that the Cartesian product fileconcept isonlyamethod

to organize records physically. We still need an indexing

scheme to locate the records. Since this indexing scheme

occupies memory space, it will be desirable to eliminate it. This can only be achieved by using some kind of

multiattri-bute hashing scheme [9], [23] which maps a record directly

toits address spacewithout thehelpofanyindexing file. In this paper, we shall show that wehave amultiattribute

hashing method for Cartesian product files. That is, forevery

Cartesian product file, we can easily design a multiattribute

hashing which produces such afile. Thishashing method has

the property ofbeingaminimumperfect hashing method [8],

TABLE V

USING SYMBOLICGRAY CODETOHASHALLOF THEPOSSIBLE RECORDS

OFEXAMPLE 3.1 Records (a,a) (a, b) (a,c) (a,d) (b,d) (b,c) (b,b) (b,a) (c,a) (c,b) (c,c) (c,d) (d,d) (d,c) (d,b) (d,a) Location 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

[11

],

[25] inthe sensethatnocollisionoccursandnomemory space is wasted. Our hashing scheme is based uponsymbolic Graycode

[9]

whichwillbe discussed in the next section.

III. SYMBOLIC GRAY CODE

The symbolic Gray code concept was proposed by Du and Lee

[9].

While the exact algorithmofthis hashing function israther complicated, its meaning is easy to understand. Con-sider thefollowing example.

Example 3.1: Let us assume that

DI

=D2 = {a, b, c, d}. The symbolic Gray code will always hash allof the possible records as shown inTableV.

It was shown in [9] that the symbolic Gray code has the followinginterestingproperties.

Property 1-Address to Key Transfornation Property: All hashing functions provide a key to address transformation. But symbolic Gray code hashing also provides the address to key transformation. That is, given an address in the address space, we can calculate the record stored in thatlocation. For

instance,

for location equal to 5, we can easily show that the recordstored there is(b, a).

Property 2-The One-to-One Correspondence: Let us de-noteKAT and AKT as the keyto addresstransformation and the address to key transformation, respectively. Property 2 meansthatif KAT(R)=i,thenAKT(i)=R.

Property3-No Collisioninthe Table: Thisis a consequence ofProperty2.

Property 4-No Waste of Memory Space: This is a conse-quenceofProperties 1 and2.

Property

5-The Nearest Neighbor Property: If symbolic Gray code is used, the Hamming distance between every pair oftwo consecutive records in the

resulting

fileis

always

equal

to one. This means that they are nearest neighborsto each other.

Property 6-TheMultiattribute

TreeProperty: Foradetailed discussion of thisproperty,consult

[9].

From the above properties, one can easily see thatthe sym-bolic

Gray

code hashing is a minimal

perfect hashing

[25]

(4)

(perfect means no collision and minimal means no waste of memoryspace).

In spite ofthe above desirable properties, the symbolic Gray code nevertheless suffers from onedisadvantage-it is not good for partial match queries. Consider Table V. Let us assume that every fourrecords are stored in one bucket.

Then,

if our query specifies the first attribute, only one bucket has to be examined. On the other hand,if our query specifies the sec-ond attribute, all bucketshave to be examined. We may say that the consecutive retrieval properties among the attributes arenotbalanced.

If there arethree attributes, this imbalance is even more pro-nounced. Atypical file produced by symbolic Gray code in-volving three attributesisnowshown inTableVI.

Note that the symbolic Gray code does produce Cartesian product files. The unfortunate thingis thatitcannotbeused to produce a Cartesian product file

specified

byus. For

in-stance,itcannotproducethe fileshowninTable III.

In this paper, we shall propose a new symbolic Gray code, called bucket-oriented symbolic Gray codeas amultiattribute hashing scheme toproduce any Cartesian product filethatwe want, in particular, aCartesianproduct file suitable for partial matchqueries.

IV. BUCKET-ORIENTED SYMBOLIC GRAY CODE We indicated before that symbolic Gray code always pro-duces a special kind of Cartesian product file. This can be modified. Note that for aCartesian productfile, eachbucket is associated with an index and the index itselfcanbe consid-ered as a record. For instance, in Example 2.1, the indexes associated withthefourbuckets are

(1,

1)

(2,

2)

(2,

1)

(1,2).

If we consider theabove 2-tuplesasmultiattribute records,we

can use symbolic Gray code toorder them into thefollowing sequence:

(1, 1)

(1,2)

(2,2)

(2, 1).

Forthe first bucket, there are four records:

(a,a)

(b,b)

(a,b)

(b, a).

We can again use symbolic Graycode to order them into the

following

sequence:

(a,a)

(a,b)

(b,b)

(b, a).

TABLE VI

ATHREE-ATTRIBUTEFILE PRODUCEDBYSYMBOLIC GRAY CODE Record RL(ALV.A2.AVA)

(Al1.

A21. A31)

(All

A21. A32)

(A1.

A21.

A33)

(A12. A22. A33) (A11' A22f A32) (A12. A22. A31) (A12. A22. A31) (A12. A2n A32) (A12. A221 A33) (A12. A21, A33) (A12. A21, A32)

(A12§A21 A31 (A13. A21 A31) (AW3 A21' A32)

(A13.

A21.

A33)

(A13'

A22. A33)

(An. A22. A32)

(Aa . A22* A3 ),

Now consider the second bucket: recordsare asfollows:

Address L 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Bucket [1, 2]. The four

(a,c)

(a,d)

(b, c)

(b, d).

We can use symbolicGraycode to order them into the follow-ing sequences:

(a,c)

(a,d)

(b,d)

(b,c).

For reasons that willbecome obvious later,wemay reverse the aboveordering withoutaffectingtheconsecutive retrieve prop-erty. Thereversed order will be

(b, c)

(b,d)

(a,d)

(a,

c).

The above process can be applied to each bucket andfinally

we shall obtainthefileinTable VII.

The reader can now see that we can use the

symbolic

Gray

code to obtain a Cartesian product file specified byus. Since

it is not the

symbolic

Gray code as

originally

proposed byDu and Lee

[9],

we shallcall the newcodebucket-oriented sym-bolic

Gray

code.

(5)

TABLE VII

USING INDEX PAIRSTO DENOTE EACHBUCKET NUMBER OF TABLEII(A)

(a,Q) Bucket

[1,1]

(a b) (b,b) (b,a) (b,c) Bucket [1,21

(a,d)

(a,c) (c,c) Bucket [2,2](2 d) (d,d) (dc (d b) Bucket 12,11 (d,a) __________ (C, b)

V. BUCKET-ORIENTED SYMBOLIC GRAY CODING AS A MULTIATTRIBUTE HASHING FUNCTION

In the previous section, we gave an outline about how sym-bolic Gray code can be used as a hashing to create any arbi-traryCartesianproductfile. Inthissection,weshow theexact

algorithm.

We are given a setof records characterized byNattributes

AI, A2,, AN. The domain ofAi isdenoted asDi. The

domain size of

Ai

isqi. Eachrecordisdenoted asR=(Alb1b

A2b2

,ANmN)

where 1

<bi

qi. Each Di is divided

into ti subdomains and the size ofeach subdomain is denoted

aszi. Eachbucket is in the form of

D1, D2S2, *, DNSN

where

Disi

is a subdomain of

Di.

Thisbucket will be denoted as

Bucket [s1,S2, * * * ,SNI

Our algorithm to hasharecord R=(AjiI,

A2iI2,-

,ANiN)

to an address consists oftwo main steps. In the first step,we

determine the order of the bucket which will contain this

record. In the secondstep, the exact location of this record

inside thebucketisdetermined.

Given a record R=

(Alb

1, A2b2X*2

ANbN)9

the bucket

[S1, S2, SN] which will contain this record isdetermined

bythefollowingformula:

Si=[

for 1 SiSNwherezt is thesize ofasubdomain of

Di

and

[xI

is the smallestintegergreater thanorequaltox.

The order of [Sl, S2, ,

SN]

is determined through

Algo-rithm A which is essentiallythekeytoaddress transformation

algorithmof thesymbolicGraycode discussed in [9].

AlgorithmA: TheAlgorithm which Determines the OrderofaBucket[S1,S2, , SN]

Input:

[S1,

S2, ,SN] [tl,t2,* tN

where

ti

is the number ofsubdomains ofthe domain of

attri-bute

Ai.

Output: The order of the bucket

[SI,

S2,

,SN].

Step1: DetermineanN-tuple

(al,

a2, * *',

aN)

through the following rules.

a) Fori= 1,let

ai

=si- 1. Thatis,

a,

=s - 1. b) For I<iN, let

L2 =a + I L3 =aIt2 +a2 +

Li

=aI(t2t3

**

ti1)

+a2(t3t4

-*

ti1)

+*

*+aj-1

+1

LN =al(t2t3 .. .

tN

1)

+a2(t3t4

.. .tN l)

+**+aN-l +; if

Li

isodd,

ai=

- 1;

if

Li

iseven,ai ti

Sit

Step 2: The order of Bucket

[Sl,

S2, ,

SN]

is now calcu-latedasfollows:

P=al(t2 tN)+a2(t3t4 tN)+

**+aN-ltN+aN+l.

Example

5.1:

Consider Example 2.1 again. Inthiscase we have fourbuckets:

Bucket [1, 1] Bucket [1,

2]

Bucket

[2,

2]

and

Bucket [2, 1]

Letus nowdeterminethe order of

Bucket [1,

1].

Since

SI =S2=1 and tl =t2=2 wehave a, =Sl - 1 = 1 - 1 =0 L2=a + 1 0+ I I isodd. Therefore, a2 =S2 - 1 = 1 - 1 =0

(a1,a2)=

(, 0)

P=alt2

+a2 + 1 =0X 2+0+1 = 1.

The order of Bucket [1, ] is 1. ForBucket

[2,

1],wehave

(6)

and S= 1 a =sl- 1=2- 1 =1 L2 =a1+ 1= 1 + 1 =2 is even. Therefore, a2 =t2 - S2 =2 - 1 = 1

(al,a2)=(1,

1)

P=alt2

+a2 + 1 = 1 X 2+ 1 + 1 =4. The orderof Bucket

[2,

11

is 4.

Having deterrnined the order ofthe bucket whichwill

con-tain the record, we can determine the relative addressofthe record inside the bucket. Again, we apply thesymbolic

Gray

codetoallof therecordsinside the bucket. If theorder ofthe bucket is even, the ordering is reversed. The following

algo-rithm determines the relative address of a record inside the bucket containing it. After determiningboth the orderofthe bucket and the relative address, the absolute address of the record canbeeasily determined.

Algorithm B: The Algorithm which Calculates the Absolute Address ofaRecord

Input: Record

R=(Alb1,A2b2 ..,

ANbN)

(Zj,Z2r

,ZN)

(t,t2r ,tN)

where zi is the sizeof eachsubdomain of

Di

and

ti

is the num-ber of subdomains of

Di.

Output: Theabsolute addressofR. Step 1: Fori I toN,

rbil

Si= [i1

Step 2: Fori l toN,

bz bi -

(Si

- 1) XZi.

Step3:

a) For i= 1, let as=

b1

1. Thatis,al 1z4 - 1

b)

For 1<i<N,let L2 =a + 1

L3

=a1z2

+a2 + 1

Li

=a1(Z2Z3

...

zi-1)

+a2(Z3Z4

...

zi-1)

+ -

+ai-

+

LN aI(Z2Z3 ZN-1) +a2(Z3Z4 ZN-1) + -- *+aN-l + 1; ifLiisodd,ai=b- 1; if

Li

iseven,ai=Zi bs c) m= a1(Z2Z3 *

ZN)+

a2(Z3Z4

...ZN)+ +aN-lZN +aN+ 1.

Step4: Apply Algorithm Ato

(SI,

S2,* ,

SN)

and

(t1,

t2, * ,tN)to determine the order P of Bucket

[sI,

S2, *, ,

SN]-Step S: IfP is even,m'=zlz2 .zN-m+1. IfPisodd, m =m. (Notethatm'isthe relative addressofRinsideBucket

[SI

S2,

SN]

whichwillcontain

R.)

Step 6: The absolute addressL of R is determined bythe followingformula:

L=(Z1Z2

--ZN)X(P 1)+m.

The above procedureiscalledthekeyto address transforma-tion(KAT) ofthebucket-oriented

symbolic

Gray

code.

Example 5.2: Consider Example 5.1 again; inthis case we

have fourbuckets: Bucket [1, 1] Bucket [1,

2]

Bucket

[2,

2]

and

Bucket

[2,

1]

Let us nowdetermine theabsoluteaddressofsomerecordin

this filesystem.

Case 1: For the

recordR=(a,c)=(A11,A23),(bl,b2)

(1,3).

After applying Algorithm B

(or

AKT),

sincez1 =

Z2=

2 and tl =t2 = 2, wehave

and

b22

S2 = b2 = p3 =2.

Sowe knowthat the recordwillbecontained inBucket

[1,

2].

Next, we have b

b1

-(si

-

1)Xz1

= 1 - (1 -

1)

X 2 = 1 and 2=b2- (S2 - 1) XZ2

=3-(2-

1)X2 =3- 2 = 1.

From Step 3, we have a b - 1 = 1- 1=0

L2

=a,

+ 1 1 isodd. Therefore,

a2 =b - 1 = 1 - 1 =0

m=az2

+a2 + 1=0X 2+0+1 = 1.

(7)

By applying Step4 (or Algorithm

A)

to

(sI,

s2)=(1, 2)and (t1, t2) = (2, 2), the order of Bucket [1, 2] is determined to be2. That is, P=2.

ForP is even, the relative address of R inside Bucket [1, 2]

is

m'

zIz2-m+l=2X2-

1+1=4. Hence, the absolute address L of R is

L=Z1Z2 X(P- l)+mt 2X 2X (2- 1)+4

=4+4=8.

That is, Record (a, c) is stored at the eighth address after applying thisKATtechnique.

Case 2: For record R=(c,

d)=(A13,

A24), (bl,

b2)= (3, 4).

After

applying Algorithm

B

(or KAT),

since

z1=z2=2

and

t1=t2=2,

wehave

and

S2=

1 =

]

= 2.

So we know that the record will be contained in Bucket [2,2]. Next we have

b1

=bi

-(s,

-

1)Xz

=3-(2-

l)X2=1

and

2b

=b2

-

(s2

-

1)

Xz2 =4-

(2

-

1)

X 2=2. ForStep 3, we have

aG

=

-bl

71 = 1-1=0 L2

=a,

+ 1=0+1 = 1 isodd.

Therefore,

a2 =

b2f

-1 =2- 1=

m

=a1Z2

+a2 + 1 =0X 2+ 1 + 1 = 2.

ForBucket [2,

2],

wehave

(sI,

s2) =(2, 2).

By applyingStep 4

(or

Algorithm

A)

to

(si

,

s2)

=

(2, 2)

and

(tl,

t2)

=

(2, 2),

the order ofBucket

[2,

2]

is determined to

be 3. That is,P=3.

ForP=3 isodd,the relativeaddressofRinside Bucket

(2,

2]

ism'=m =2.

Hence,theabsolute addressL ofR is

L =zIz2 X (P-1)+m'=2X 2X(3- 1)+2=10. That is, Record (c, d) is stored at the tenth address after

applying

this KATtechnique.

If the reader consults Example 2.1 with Table III, hewill discover that the addresses of

(a, c)

and

(c,

d)

arejustthe same

asthoseinTable III. Infact,ifAlgorithmB isused,all records in

D,

XD2 will beorganized asshown in Table III.

Let us now conclude this section by statingthe fact again that givenanarbitraryCartesianproduct file,we canapplythe bucket-oriented symbolic Graycode to determine the address ofevery record in the file. In otherwords, the bucket-oriented symbolic Gray code can be used as a multiattribute hashing functiontoproduceanyarbitrary Cartesian productfile.

VI. SOME PROPERTIES OF USINGBUCKET-ORIENTED SYMBOLIC GRAY CODE TOORGANIZE ANY

CARTESIAN PRODUCT FILE

In this section, we shall present some interesting properties ofusing the bucket-oriented symbolic Gray code to produce Cartesian product file systems.

Property 1-The Address to Key Transformation: While most hashing functions provide "key to address transforma-tion" only, our bucket-oriented symbolic Gray code also provides an "address to key transformation" (AKT) which maps an address to a unique record. Let us note that the total number ofpossible records of a file D1 X D2 X * XDN is q

lq2

* * *

qN,

where

qi

is the size of

Di.

Suppose allofthe records are stored in NB buckets. In the following, we shall show that we have an address to key transformation.

Algorthm C: To ConvertanAddress

toitsAssociated Record Input:

(Z

1,Z2,

,ZN)

(q1,q2c XqN)

L,1.L.q1q2..-qN

where

qi

andzi are the sizes of

Di

andeach subdomain of

Di,

respectively. Output:

(Alb

1j

A2b2

-

,ANbN)

where R=

(Alb

, 2b2

ANbN)

is arecordwhichis

asso-ciated with the addressL. Step 1: Calculate

m

L-

]

I

XZ1Z2

*-ZN

Z1Z2

ZN

where m' is the relative address of the recordRL inside the bucketcontaining it.

Step2: Calculate L-P

=;

ZIZ2 . ..ZN ifPisodd, mm'; if Piseven, m Z1Z2 ...ZN-m +.

P istheorder of thebucketcontainingRL. Step 3:

a) Determine an N-tuple

(al,

a2, ,

aN)

through the followingequation:

(8)

P=al(t2t3 . . .tN)+ii2(t3t4 ..tN)

+ -+aN* ltN+aN+ I

where

ti

=

q/lzi

and PisobtainedinStep2. b) Fori= 1 toN, determineasfollows:

p

si

=a +1, if ti ti+1 * *tN isodd.

si=t1-a1,

itj+l

tN]

iseven.

[S1

, S2,

,SNI

isthebucket

containing

RL.

Step4:

a) Determine an N-tuple

(a,,

a2,

,aN)

through the followingequation:

m=al(Z2Z3 ...ZN)+a2-(Z3Z4

ZN)

+

+aN.lZN

+aN + 1

where

a!s

are allintegers and m is obtained in Step 2. b) Fori= 1 toN,determine

b'

asfollows:

b>.=a,

+ 1, if

_ziZi+1

[ . l is odd.

. ZN

bz=Zi

-

ai,

if [ ] iseven.

Zi i+1 ..ZN StepS:

Foril=

toN,

bi=ziX (s -

1)+b;'

Step 6:

(A blb ,A2b 2'''**

ANfbN)

iSRL*

The above procedure is called the addressto key transforma-tion

(AKT)

of thebucket-oriented symbolic Graycode.

Now, let usshowhowtheAKT can beapplied to the data in Table III.

Example 6.1: Consider the case where L 10, ql = q2 = 4 andZ1 =Z2 =2. Step 1: m' 10- 2X10 l I1 X 2 X 2= 10- 8=2 Step2:

12X2

=

=3.

m m=2becausePisodd. Step 3:

a)

t1=q1/z =2and t2 =q2

/Z2

=2;wehave 3=alt2 +a2 + 1 =

2a1

+a2 +1

Thisgives

(a,,

a2)

=

(1,

0)

b)

Because

[k1f21 =[| 21

= 1

is

odd,

sl =a1+1=1+

1 =2.

Because

t=

[2-

=2

is even, S2 =t2 -a2 =2-0=2. [2,

2]

isthebucket inwhichR10 isstored.

Step 4:

a) m

=a1z2

+a2 +

1.

We have

2=a1

-2+a2 +I Thisgives

(a1,

a2) =

(0,

1).

b) Because 1= [ =1 isodd,

b'

=a +1=0+1= 1.

Because

[-]

=2[

1 is odd,

b2

=a2 + 1 = 1 + 1 = 2. Step5:

b,

=z

-(s,

-

1)+b'1

=22-(2- 1)+1

=3 b2

=z2

(s22-

l)+b =2-(2- 1)+2

=4.

Therefore,wehave

RI0

=(A 13,

A24).

Property 2-The One-to-OneCorrespondenceProperty: We have shown the key to

address

transformation

(KAT)

mecha-nism ofthe bucket-oriented

symbolic Gray

code

hashing

func-tion. We have also shown the address to

key

transformation

(AKT)

in the hashing function. We now ask: What is the relationship betweenthesetwotransformations?

The following theorem depicts that there is a one to one

correspondence

relationship

betweentheaddresses andrecords. Theorem 6.1: For every recordR

EDI

X

D2 X*

X

DN,

if

KAT(R)

=L, then

AKT(L)

=R.

(The

proof

of this theorem

canbe foundin

Appendix

A.)

The reader canverify this point by checking intoTable

III.

If he applies the AKT to any address I

SL

S

16,

he will obtainthe record stored in that location and ifhe

applies

the KAT to the record already storedthere, he will obtain

exactly

thesameaddress.

Property 3-No Collisioninthe Hash Table: Formost hash-ing functions, therewill be

collisions

inthehashtable because

two distinct records may be hashed into the same address. FromProperty2, we knowthat ifRi isdifferent fromR2, the

address of

R1

will

be different from that ofR2. Thus there will be no collisions in the hash table. We may say that the bucket-oriented

symbolic

Gray

code can be considered as a

perfecthash

function

[8], [11], [25].

Property 4-No Waste

of

Memory

Space:

Formost

hashing

functions, ifwe know that the total number of records to be stored is M and some kind of

hashing

function is used, we

usually

must reserve more thanMlocations. It isnotthecase

when this

hashing

functionisused. Becauseof

Property

2and Property 1, we onlyhavetoreserve

exactly

M

locations. Thus,

the bucket-oriented

symbolic Gray

code is aminimal

perfect

(9)

Property 5-The Nearest Neighbor Property: For I AL < NR=q1 q2 --

qN,

the Hamming distance betweentherecord

RL stored at location L and the recordRL+1 stored at loca-tion L+ 1 is always 1. Because ofthisspecialproperty, every pair of consecutive records in the hash tablearenearest neigh-bors to each other. This is a very desirable property for

or-ganizing records for a best matchsearching system [2], [4],

[9], [12]-[14], [22].

Theorem 6.2: Let there be N sets: D1, D2, ,

DN

where

Di

{Ai1,

Ai2,

* - -

,Aiqi}

and

qi

> 1. Let RL denote the

record associatedwithLbyapplying AlgorithmB(KAT) to L. LetNR denote the total number of records in

DI

XD2 X ...

XDN. Then the Hamming distance between

Ri

and

Ri+I

is 1, for 1 < i<NR. (The proof ofthis theorem can be found in AppendixB.)

Property 6-Appropriate for Partial Match Searching: It was shown in [6] that Cartesian productfilesweresuitable for partial match searching. Since any Cartesian product file can be produced by using the bucket-oriented symbolic Gray code,we shall say that thishashing scheme isgoodforpartial matchsearching.

Property 7-For any Partial Match Query, It isEasy to De-termine All Buckets Necessary to Be Examined: Assume a

partial match query is of the

following

form. Retrieve all records where

A1l

=Aibi,

Ai2

=Ai2bi2

A. =A

Afb

j and i1 . i2 ... ii. Assume each Ak

bik

isinDi 1 S

ikk

~lkSik~

k

.j.

The bucketswehavetoexamineare

[SI

S2, *

*N

I 's,

where Skisanyvalueranged from 1 to tk,

(tk

isthe numberof subdomains ofthedomainof attributeA

k)

ifk *

ip,

1 < p.1 andSk=

Si

Iifotherwise.

Forinstance, inTable III, consider the query

(AI

=c, A2 = *). That is, the query is

(A 13,

*).

SinceA 13 isinD12,s1 =2 and S2 can be from 1 to 2. So there are two buckets

[2,

1] and [2,

2]

tobeexamined. Byapplying AlgorithmAto these

two

buckets,WF

have the order ofthese buckets being4 and 3, respectively. Hence we can conclude that bucket 3 and bucket 4mustbe examined forthe query

(A1

=c,A2 =

*)

Property 8-The MultiattributeTreeProperty: Assume that

we have asequenceofbuckets, BK1,BK2, andBKNB pro-duced by the bucket-oriented symbolic Gray code. This

se-quence of buckets canbe viewed asa tree whose structure is explainedasfollows.

1)

The topnode ofthetree

corresponds

toall buckets.

2)

For level 1 of the tree, there are tl nodes:

B,,

B2, **

Bt,.

Eachnode corresponds to a set oft2t3 * tN buckets. Thus thefirstnodeonlevel 1 consistsofbuckets ordered from 1 tot2 t3 ...tN. Thesecond nodeconsistsof buckets ordered

fromt2 t3 * tN+ Ito

2(t2

t3 * * *

tN),

etc.

3) Each node on level 1 is split into t2 nodes on level 2. Thus there are t1t2 nodes onlevel 2. Each node corresponds to t3 t4 *

tN

buckets. Thus the first nodeonlevel2consists of buckets ordered from 1 to t3 t4 *. tN. The second node corresponds to buckets ordered from t3 t4 . .

tN

+ 1 to

2(t3

t4 .

tN),

etc.

4)

In general, there are

tl

t2 *

t,

nodes on level i of the

tree. Each nodecorrespondsto

ti+1

t4+2

...

tN

buckets.

5)

Within each node on level i, S1, S2, * * ,si assume the

same

value,

forall bucketsinthis node.

B2 =ig3 B31t 2 BK2BK4= l[21

[,1,211

2] BK5= ([2,2,]

/X~~~D2

BK7

[2

1 2]2

1 B411 BKu= [4,2,11 \~~~~~~~D~I BK5= [4,2,1]

D12B2

BK6

= [4,2,21 B4 D21 ~ 3 421 BK9= 14,1,2]

D2B

D31 3 YB422 BK16= [4,1,1]

Fig.4. A multiattribute tree.

Example6.2: Consider a three-key records set:

D1={A11,A12,A13,A14,A15,A16,A17,A8}

and

D3={A31,A32,A33,A34,A35,A36}.

LetD1 ={A1A12},D12 = {AD3,A24},D13 = 3,22 ={A17,A

1=

{A21B,A222},D22 ={A23,A24},D31 =

{A31,A32,A33jJ,andD32 =

D{A34,A35,A36}1

NR =

q1-q2-

q3-8 X 4 X 6 = 192 BZ=z1 *Z2*Z3=2X2X 3=12

NB=

t1

* t * t3 = q X2 X =4X 2X 2=

16.

z1 z2 z3

The tree corresponding to the buckets in which all the rec-ords in this case is stored is now depicted as in Fig. 4.

For

level

I of the tree, the first node B1 corresponds to buck-ets ordered from 1 to 4. On the second level of the tree, the second node

B12

corresponds to buckets ordered in 3 and 4. For records in B1,the first key is inD16 =

{A

lj,

A 12

}

for all records. In B2, the first key is in

D11

and the second key is

inD22

=

{A2

3,A24}

for allreords.

This kind

of-structure

is called multiple-attribute tree [7],

[15].@

VII. CONCLUDING REMARKS

In this paper, we proposed the bucket-oriented symbolic Gray codeas amultiattributeminimal perfect hash function.

Thishashing function can be used to

produce

any

arbitrary

Cartesian

product

file. Since Cartesian

product

file systems

(10)

have been shown to be appropriate forpartialmatch queries, our bucket-oriented symbolic Gray code is appropriate for partial match queries. Wewould like to emphasize here that hashing is good in this case because no index file is necessary.

Our nextjob is to investigatehowourmultiattribute hashing function can be used to organize files where some possible recordsaremissing.

APPENDIX A PROOF OFTHEOREM 6.1 Definition: Let I.L.qlq2 *

-qN

L

=al(q2q3

..

*qN)

+

a2(q3q4

qN)

+ * * aN-2(qN- 1qN)+aN_ 1qN+aN+ 1 where

ai

is aninteger, 0 S

ai

<

qi.

ThisN-tuple

(a,,

a2, * * *,

aN)

is calledthe

qi

representation of L. For example, consider the case where q1 =3,

q2

=2, q3 =3, and L=6. Thenq2q3 =2X 3 andq3 =3.

Therefore,

6=

0(2

X

3)

+

1(3)

+2 + 1. This means

(a1,

a2,

a3)

=

(0,

1, 2).

The

qi

representation of6 is

(0,

1,2).

Lemma 1: Let A =

(a,,

a2, * * *,

aN)

be the

qi

representa-tion ofL and B=

(b1,

b2,-** ,

bN)

be the

qi

representationof L + 1, then the Hamming distance between A and Bis m >1 and

a)

ai=bi,

for I.i.N- k,k m

b)

aN-k+l + 1 =

bN-k+l

c)

ai=qi-

1 forN-k+2SiSN and

bi

=O

Proof: For any integerL, 1 <L< q1 q2 ...

qN,

there is

only one N-tuple

(a1,

a2, ,

aN)

such that L =a1q2q3 ...

qN+a2q3 -* qN * +a_IqN+aN+

1,whereOSai<qi.

(1)

For integer L + 1, 1 S L+ 1 <

q1

q2 *

qN,

there is only

one N-tuple

(b1,

b2, - ,bN), such thatL + 1=

bIq2q3

...

QN

+b2q3

*qN

+*

+

bN-,qN +bN

+

1,

where

0.bi

S

qi. So,

L=b q2q3 qN +

b2q3

** qN+

bN-1qN+ bN

(2)

Compare

(1)

and

(2)-there

are two

possibilities.

Case 1:

0

<aN

<qN - 1, 0

<aN

+ I<

qN-Assume bN=aN +l and b1=

ai,

i= 1, 2, - ,N- 1. In this

case, the Hamming distance between

(a,,

a2,

,aN)

and

(bC,ba2

bN)

is

2

Case 2:

aN =qN- l,aNl -qN-l - 1, * ,aN-k+2

=qN-k+2.-

1

and

aN-k+l

<qN-k+l

-1, forsome

k,

1<

k

.N.

Inthiscase,

bN-k+l

=aN-k+l +1,

bN-k+2

=bN-k+3

*bN-1 =bN=O

and

bi=ai,

i=1,2,- ,N- k.

The Hamming distance between

(a,,

a2,- -- ,

aN)

and

(bI,

b2,- --,bN)is m,where 1 m<k.

Ingeneral, the Hamming distance between

(a,,

a2, - ,

a-N)

and(bl,b2,

- --,bN)ism> 1 and

ai =bi

for 1 i.<N- k, and

aN-k+l + 1 =bN-k+l- Q.E.D.

Lemma 2: Let A =

(a,,

a2, * * *,

aN)

be the

qi

representa-tion ofL where I < L <

q1

q2 ...

qN.

Then

L |qiqi+l . QN

isequalto

a,(q2q3 * *qi-.) +a2(q3q4 . qi-,1)+* * * +ai-, + 1.

Proof- Since A =

(a,

, a2,- -,

aN)

isthe

qi

representation

ofL, we have L =

alq2q3

* * * qN + a2q3 ** qN + * +

aN.4qN

+aN+ 1,

[a L2]

Iqiqi+l 4NiN

_ al

q2q3"

* * qN+a2q3 ...qN + *,+aN-lqN+aN+I =a

(q2q3

.-.

qi-)

+

a2(q3q4

qi-q)

+

*+

ai-1

+

[aiqi+l

qN

+ai+lqI+2

...qN

+

+aN-lqN+aN+1

qiqi I qN

Since

ai

<

qi,

wehave

ai

<

qi

- 1 and 1

.aiqi+l

...qN

+ai+lqj+2

qN

+ * * *+aN-,qN+aN + I

<

(qi 1)qi+l

- qN +

(qi+l

-)q1+2

qN + +(qN-1 - I)qN+(qN- )+ I

=qiqi+l

,qN-Hence

aiqi+l

...qN+

ai+lqi+2 qN+-

.+aN-qN

+aN

+

=1

qiqi+l a +qN

Thatis,

L q~l is

a,(q2q3

-

qi-)

qiqi+l * qN

Wehave theproof. Q.E.D.

Lemma 3: Let A =

(a,,

a2, * * *,

aN)

be the

qi

representa-tion ofuandB=

(b

I,

b2,* **, bN)bethe

qi

representationof

V.

Let

X=

(XI,

X2, * **,

XN)

and Y=

(Yl,

Y2,

,

YN)

be

twoN-tuples whicharedefinedasfollows:

(11)

xi=-ai + 1, if u isodd,

iqi

[qqi qN

xi=qi-ai, if iseven,

qi qi+1. QN and yi

bi

+ 1, if isodd, qiqi+l ..QN v

yi

qi bi, if is even. ,qiqi+l ..qN

IfIu- v = 1,theHamming distancebetween X and Yis 1.

Proof: Because Iu- v = 1,wemay supposethatv=u+ 1.

Let A=(a1, a2, * ,

aN)

be the qi representation ofu and

B=(b1,

b2,

* * ,bN) be the qi representation ofu+ 1. By

Lemma 1, the Hamming distance betweenA andB ism> 1

and

ai=bi

for

1.i<N-k,k>m.

aN-k+l + 1 =

bN-k+l

By Lemma 2, wehave

(1) (2)

[qiqi+l ]qN = alq2 ...qi-l +a2q3 * * * qi-l. + --+ai-l +1

=bIq2 - qi-l +

b23

ql

[+q--i

biN+

=qiqi+l -fqN N

for Il Ai<N- k+l.

(3)

Equations

(1)-(3)

imply that

Yi'xi

for

1.i.N-k

and YN-k+l

/ZXN-k+1

Consider [qjqj+i ~ Iand [-1

v'

] for N- k+22i.N. Inthiscase,

1=alq2

...qi-1 +a2q3 ... qi-qiqi+l ..qN

=(alq2

..qi-l

+a2q3*..qi-l

+* * +aNkqNk+l . qi-l

+aN-ks qN-k+2 ...qi-l

+

[(qN-k+2

-

1)qN-k+3 ...qi-l

+(qN-k+3 -

I)qN-k+4

qi-l

(4)

=

Iaq2

...qi-l +a2q3 ...qi-l

+- +aN-k+lqN-k+2 qi-l +QN-k +2qN-k+3 ..qi-1

V

qiqi+l...q

b,

q2 ***qi-l +b2Q3 ... qi-l

.+-- -+bi-l+I

=[a,q2

...

qil

+a2q3...

qi-l

+' +aN-kqN-k+l ..qi-l]

+(aN-k+l + I)qN-k+2 ..qi-l +1

=a,q2 -* qi-l +a2q3 * - qi-l

+.. +aN-kqN-k+l qi-l

+aN-k+lqN-k+2 - qi-l

+qN-k+2 ... qi-l + 1 = qiqi+l .. + 1.

Case 1:

[ulqiqi+l

...

qyNl

iseven. In thiscase

,xi

=

qi

-

ai

=

qi-

(qi

- 1)= 1, since ai=qi- 1. Then

[vlqiqi+l

*

qN1

is

odd. Therefore,yi=

bi

+ 1 =0+1= 1, since

bi

=0. We have xi =yi= 1.

Case 2:

[u/qiqi+l

*

*qN

l isodd. In thiscase,

xi

=ai

+1 =

(qi

- 1)+ I =

qi,

since

ai

=

qi

- 1.

vlqiqi+l

*...

qNl

is odd.

Therefore,

yi=qi-

bi

=

qi

- 0=

qi. We have xi=yi qiinthiscase.

Hence,

xi

=yifor N-k+2< i<N.

Combining (4)

and

(5),

weconcludethat the

Hamming

distancebetween X and Y is 1. Q.E.D. Lemma 4: Let F be a

function,

F:

(fi,

f2, fn)

-K,

fi

< qi, defined

by

following

equation:

K=

a,(q2q3

...qN)+a2(q3q4 ..qN)

where the

N-tuple

(a,,

a2, * * *,

aN)

isdetermined

through

the followingrules:

a)

a=flf

- 1.

b) For 1 < i<N L2 =a, + 1

L3 =a,(q2)+a2 + 1

Li

=a,(q2q3

-*

qi-1)+a2(q3q4

...

qi-.

)

+

-+aj-,+

I

LN

=a,

(q2

q3 * ** QN-1

)

+a2

(q3q4

..

**QN

-1

)

+- - +aN-1 + 1;

if

Li

isodd,

ai=fi-

1 if

Li

iseven,

ai =qi

-fi.

(12)

qlq2 * qN, gi = qi - bi, if

[m/qiqi+l

...qNl is even, gi=

bi+ 1, if

[m/qiqi+l

...

qNvl

isodd,where theN-tuple

(bI,

b2,

.* ,bN)isdetermined throughthefollowingequation:

m=bl(q2q3

-

*qN)+b2(q3q4

qN)

+***+

bN

qN+bN+ I

Then G=

F-1.

Proof: For anyN-tuple

(f,

f2,

fN

),F:

(fl,

f2,

,

fN)

-+K,

K=

a1

(q2q3

...qN)+a2

(q3q4

*

qN)

where

a,,

a2, ,

aN

aredeterminedby a) a1

=f1-

1

b) for 1<iN,

if

al(q2q3

" * *

q11j)

+a2(q3q4 **q-)

+-

+aj-1+

I is

odd,

-ai=fj

- ;

(1)

if

al(q2q3

** *

qj-.

)+a2

(q3q4

...

qj_l)

+

--*+ai1

+1

-iseven,

ai=qi-

fi.

(2) Weshall showthat

G(F(fi, f2,*

,N))

=(fi,f2

,fN)

or

-

G(K) =(fi,

, 2,* *

N)-Suppose

G(K)

(gl,

92,*

gN)

For i

1,

since

K.

q1q2 * - qN=

q1q1+1

...qN,

[K/qiqi+l

...qNl = 1 is

odd,

we have g1 =a1 + 1 or a1 =g1 - 1. By Lemma 2, since

(g1

,g2, * *

gN)

is theqi

representation

ofK,wehave [qKqj+i.. =a,(q2q3 *qi-) +a2(q3q4 ...qi-l)

qiqj+l qN

Because ofthe conditions ofthe G

function,

if

[K/qiqi+1

*..qN] is odd,

gi=ai+1.

If

[Kf/qiqi+I

*-qN] is even, gi =

qi-

ai.

That is, if

al(q2q3

. . .

qi-,)

+

a2(q3q4

qj-1)

+* +ai- +Iis odd,

gi

=a1+1 or

ai =gi-

1; (3)

if

al(q2q3

qi-)

+

a2(q3q4

qi-,)

+ +a11 I1+ is even,

gi=qi-ai

or

ai=qi-gi

(4)

Comparing

(1)

and

(3), (2)

and

(4),

we

havegi

=

fi.

So,

G(F(fi,

f2,,

fN))

=

G(K)

=

(fl, f2,

,fN).

That

is,

G

=F'1

.

Q.E.D. Theorem 6.1: For every recordR ED1 XD2 X * X

DN,

if

KAT(R)

=L,then

AKT(L)

=R.

Proof

Let R

=(Alb

1,

A2b2

* ANbN) where the ad-dress associated with R is determined bythe KAT

(Algorithm

B).

Let

KAT(R)

=L. Weshallnowshow that the record

(rLl,

rL2, ,

rLN)

associated with L determined through the use ofAKT

(Algorithm C)

isexactly

(A

1b1,

A

2b2

b2 *

ANbN)-From KAT, we

finally

have

L=Z1Z2

...

ZN(P-

l)+m'.

Let

ZlZ2

*--ZN=C. Then L=C(P-

1)+m',

where

m'

is eitherC- m + I or m. By Step 3,

m =a1(Z2Z3 * * *ZN) + a2(Z3Z4

ZN)

+.-

+aNlZNN+aN+l,

where O < ai <ziand

ai,

ziareintegers. Wehave

O(Z2Z3

...

ZN)

+

O(Z3Z4

...

ZN)

+

OZN

+O+1 <m<-

(Z-

1)

(Z2Z3

.

ZN)

+

(Z2

-)(Z3Z4

...ZN)

+

(ZN-1

)ZN

+

(ZN

-

1)

+1.

So.1m.zlz2 .zN,i.e.,l.m.C.

Hence,l.m'.C.

Consider L C(P- 1)+

m',

and I <

m'<

C. WehaveL

-C(P-

l)=m',

1 .L -

C(P-

1).C.

Therefore, + I+

>PL-.

C C SinceP isaninteger,

LC |+> >

C-Because

LC Cl

wehave Lc cL So P= [C- if L C(P- 1)+m' and m'=L - [Li - 1) -C. Inthiscase, =ZJZ2

_mzz...ZN-+

Z +

=C

L+([C]

1)

.C+

1 if

[c]

iseven.

m=m' L-(F-1l)

I

C(

if

[L]

isodd.

Consider theAKT

procedures.

Foranaddress

L,

wehave

mLL

(ZIZ2

.'

ZNl

')ZlZ2 ..ZN

r L (rE~L 1) L

(13)

if isodd m=m =L ([Cl 1 C.

if C is even, mL =Z1Z2 ZN-mL+

=

[jC-L+1.

_C

So,mL

=mandPL

P.

Let F:

(b

,

b,...

,

bXr)

--m,

by using

the

procedures

of Step 3 in Algorithm B andG: mL

(b51,

L

bl2,

,

bL) by

usingtheprocedures ofStep4 in

Algorithm

C.

By Lemma 4, we have G=F1. Since mL =m, we have

(14k,

bt

2

,.**, bt)

=(b;,

b,

... ,

b%).

In similarmanner,

let H: (S1,S2

SN)

+P

by

using the

procedures

of

Step

4 in Algorithm B and T: PL -*

(SLI,

SL2,

,

SLN)

by using

the

procedures ofStep 3 in

Algorithm

C.

By

Lemma

4,

we have T

=H-1

Since PL =P, we have (SL1, SL2,* * ,SLN) =(SI, S2, SN). For Step 1 andStep2in

Algorithm B,

wehave

si

[bi and b

=bi

-

(si

-

l)zi,

respectively.

zi That is,

bi

=

b'+(Si

-

1)

'Zt.

Pi=

[]

and

P+ =

For [i/Cl and [i+ 1

/CI,

wehave two cases to consider. Case1:

[t] =[C 1]= integer.

In this case m + m+ 1 and

Pi+I

=

Pi.

If

Pi

=Pi,,

is odd,

mi

=14 and

mj+

=

ml+1,

wehave

mj+1

=

mi

+ 1.

For

Pi

=Pi+,

iseven, mi =C-

mi

+ andm+=C-

m;+

+

1,wehavemi=mj+j

+ 1.

So we canconcludethat

mi+1

-

mi

= 1.

If

Pi+1

=Pi,

wehave

(SilwSi2,a

*SiN)

=(S(i+l)

S(i+n)2ed

S(f+oll)

where

sii's

and

s(j+j)j's

aredeflnedasfollows:

stat++l tN si=tj- ai, if [

pit1+l

t

isodd is even (1)

ForStep S in

Algorithm

C,

wehave

bLizl(Ll-1)+b'1 (2)

bLi=_ Zi(SLi ) bLi-(2

Since

b'

= b iand

si

=SLi, comparing

(1)

with

(2),

we have

bLi

=

b1.

Inotherwords,

(rL1,rL2, ,rLN)=(AIbL

IA2bL2, ANbLN)

-

(A

lbl,

A2b,2S, ANbN).

Q.1).D

APPENDIX B

PROOFOF THEOREM 6.2

Theorem 6.2: Let there beNsets:

D1, D2,

DN

where

Di

=

{Ail,

Ai2,, Aiqj}

and

qi

> 1. Let

RL

denote the

record associated with L. Let NRdenotethetotal number of records in

DI

XD2 X * XDN. Then the

Hamming

distance

between

Ri

and

Ri+j

is 1, for 1 S i <NR.

Proof:

(z1z2,- ,ZN)

and

(q1,

q2,

*,qN)

are

given.

Let

C=ZlZ2

* ZN- Let

Ri

=

(ri,

ri2,**, rN),

and

Ri+j

=

(r(i+,)l

,

r(i+1)2,

*

,r(i+,)N)-

;N

We want to show that

d(Ri,

R1+..)

=

Sf'

6(rij, r(i+l))

=

1-By

applying

AKT

(or Algorithm

C),

wehave

M~~+,(

Ci

+1 1

I)-and

s(i+)j=a(1+l)1+

1, if

[ttj

l tN is odd

SqE+i)

=

_s(i+,)j,,

t -

a(q+l),

if [

~~~tjtj+l

_tNI _isv

ee

then

(ail,

ai2,

,a1N)

is the

t1

representation of

Pi

and

(a(il),

a(1+1)2,

** ,

a(i

+

))

is the

t1

representation of

Pi+.

Since

jmM+l

-

m11

= 1, by Lemma 3, wehave

(b;1,

b;2,

b;N)

and

(b1i+1)1,

b+i)2,

,)N)

as two

N-tuples

which are defined as below and the

Hamming

distance be-tweenthem is 1. * if Z-Z-if' ZN isodd b~--,-z a i1, if is even , j IZjZj+ . ZN and

b(i+i

-=a(i+l)i+ 1, if m1+ ZjZj+l ..*ZN

o+v=

Zj -

a(u+w)1,

if [ * ZZ1 ZNI isodd iseven

where

(aiI,

ai2,

,

ajN)

is the zJ

representation

of

mi

and

(a(i+,)I,

a(i+,)2,

* * -,

a(i+)N)

is the

z1

representationof

mi+

.

(14)

j = 1, 2, * , N, since the Hamming distance between

(b;I,

Mb2, bN) and

(b'i+1)1,

b(1+1)2, * ,

b'i+1)N)

is 1,thereexists a k, such that bik b(i+I)k and b, =

bi+

1)-

for all

j

= 1, 2,

*,N,j*k.

Since

si1

=S(i)f, for allj= 1,

2,

,N, we have only one

bik.:b(i+l)k

and

bij=b(i+l)b,

for all

j= 1,

2

N, j

k. Because

rij

=

A1bi

and

r(i+1)j

=

A1b(i+1)1,

we have

rij

=

ry+j);

forall j 1, 2,-*

*,

N,j k and

rik#

r(i+l)k.

Therefore,

N

d(Ri,

Rjj)=

(rij,

r(

+1)i)=1

j=l Case 2:

[ntis]stio and e. Inthis

situation,

for

ri/Cl

=e,we have

e- l<- <e.

C

Ifi/C<e,ori<Ce,wehave 'i+ I

i.Ce-

l,

i+1<Ce,

C e.

So

[i+

1/Cl

.e,

which is contradict to

[i+

1/Cl e+ 1. Hence wehavei/C=e,that is i= Ce.

Inthis case, wehavetwopossibilities.

1)

If

Pi

=eisodd,then

Pi+,

=e+ 1 is even, wehave mi=mi,=i-(e- l)-C=i-eC+C=C

Mi+l =C-

mt+1

+ 1 C-

[(i

+ 1)- eC] + I

=C-i+eC=C. That is, Mi=mi+.. =C.

2)

If

Pi

=eiseven,then

Pi+,

=e+ I is

odd,

wehave

mi=C-

[i- (e-

1)-C]

+ 1 C-

i+(e-

1) *C+

1= 1 and

mrn+1

=Mi+

i+ 1 - eC= 1. Thatis,Mi=

Mi+1

1.

Sowehave concluded that if

[i/Cl

=eand [i+

1/Cl

=e+ 1,

mi

=mi+l and

Pi+,

=

Pi

+1.

For

Pi+l =Pi

+ 1, by Lemma 3, wehave

(sil,

s

S2,*,siN)

and

(s(i+l)i,

S(i+1)2,

S

(i+)N)

or twoN-tuples which are definedasbelow and the Hamming distance between themis 1:

1,jtj+l f. tN]

sij=t-a11, if

+Pi

isodd iseven

and

s(i+I)j=a(i+1)j+

1,

if

[Pil

tjtj+l ...tN isodd

s(j)j

= t1 -

a(j+i)j,

if

[t

ti+1

1 iseven

where

(ail,

ai2

,

ajN)

is the t1 representation of

Pi

and

(a(i+)

I,

a(i+1)2,

,

aq+r)

isthet, representationof

Pi+,.

Formj+j

.

mi,

we have(b1, bi2,* b

=(b(i+l)i, b(i+l)2,

*,bi+)N),

whereb1 's and

bji+1)1's

are definedasfollows: bg =

aii

+ 1, if

[

Z

isodd ZjZj+I.. ZN b;1=z1-

ai,

if rn zN] is even and

bl(i+l)

=a(i+l)j+ 1, if [ ZN] ZjZj+l ...ZN isodd is even where

(ail, ai2,*

* ,

aiN)

is the z; representation of

mi

and

(a(i+),

a(i+1)2,

***,a(+)N)

is

the

zi

representation

ofmi+,. For

b11

=

zj(sij

- 1)+

b,'

andb(i+l)1

=

zj(s(1+1)i

- 1)+bl

since the Hamming distance between

(sil,

Si2 ,

siN)

and

(s(i+l)

1 i

S(i+)2

i+))

is 1,there exists aw, such that

Aiw

Aq(i+l)w

and

sij

=

ss+)j,for

all

j=

1, 2,

,N,

j

w.

Since

b!

=

b(1+1,

for all j=1, 2,* ,N, we have

bi

+

b(i+l)w

and

bij

=

b(i+l)j,

for allj 1, 2, ,N, j# w. Since

rij

=

Ajb,,

and

r(1+1)j

=

Ajb(i+)1,

riw

/

r(i+l)w.and rij

=

r(i+1),

for allj

1,2,

N,

j#w.

Hence,

d(Ri, R1+.)

= i

r(I+l)1) = 1. Q.E.D.

REFERENCES

11] A. V. Aho and J. D. Ullman, "Optimal partial-match retrieval when fields are independentlyspecified,"ACM Trans. Database Syst.,vol.4,pp. 168-179,June1979.

[2] J. L. Bentley and J. H. Friedman, "Data structures for range searching,"Comput.Surveys,vol.11,pp.397-409,Dec. 1979.

131 W. A. Burkhard, "Partial-match hash coding: Benefits of redun-dancy," ACM Trans. Database Syst., vol. 4,pp. 228-239,June

1979.

141 W. A. Burkhard and R. M. Keller, "Some approaches to best-match file searching," Commun. Ass. Comput. Mach., vol. 16, pp.230-236, Apr. 1973.

S5] C. C. Chang and R.C.T. Lee, "Optimal Cartesian product files for partial match queries and partial match patterns," inProc. NCS1979Conf., Taipei, Taiwan,Dec. 1979,pp.5-27-5-37. 161 C. C. Chang, R.C.T. Lee, and H. C. Du, "Some properties of

Cartesian product files," in Proc. ACM-SIGMOD 1980 Conf.,

SantaMonica, CA, May 1980,pp. 157-168.

171 J. M.Changand K. S. Fu,"On theretrieval time and thestorage space ofdoubly-chained multiple-attribute treedatabase organi-zation,"PolicyAnal.Inform.Syst.,vol. 1,pp.22-48,Jan. 1978.

181 R. J. Cichelli, "Minimal perfect hash functions made simple,"

Commun. Ass. Comput.Mach.,vol.23,pp.17-19,Jan. 1980. 191 H. C. Du and R.C.T. Lee, "Symbolic gray code as a multikey

hashing function," IEEE Trans. Pattern Anal. MachineIntell.,

vol.PAMI-2, pp. 83-90, Jan. 1980.

b(i+

)izi-

=

a(i+1),

if

1i+

(15)

[10] H. C. Du and J. S. Sobolewski, Disk Allocation for Cartesian Product Files onMultiple Disk Systems. Seattle: Univ. Washing-ton,1980.

[11] M. W. Du,K.F. Jea, and D. W. Shieh, "The study ofanew per-fect hash scheme," inProc.COMPSAC1980, pp. 341-347. [12] J. H.Friedman, F.Baskett, and L.J.Shustek,"Analgorithmfor

finding nearest neighbors," IEEE Trans. Comput., vol. C-24, pp.1000-1006,Oct.1975.

[13] J. H. Friedman,J. L. Bentley, and R. A. Finkel, "An algorithm for finding best matches in logarithmic expected time," ACM Trans. Math. Software, vol. 3, pp. 209-226, Sept. 1977.

114] K.Fukunaga andP. M. Narenda,"Abranch and boundalgorithm for computingk-nearest neighbors," IEEE Trans. Comput., vol. C-24, pp.750-753, July1975.

[15] R. L. Kashyap, S.K.C. Subas, and S. B. Yao, "Analysis of the multiple-attribute-tree data-base organization," IEEE Trans. Software Eng., vol.SE-3,pp.451-567,Nov. 1977.

1161 S. P.Ghosh, Data BaseOrganizationfor Data Management. New York: Academic, 1977.

[17] R.C.T. Lee, Y. H. Chin,and S. C.Chang,"Application of princi-pal component analysis to multikey searching," IEEE Trans. Software Eng., vol. SE-2, pp. 185-193, Sept. 1976.

118] R.C.T. Lee and S. H. Tseng, "Multi-key sorting," Policy Anal. Inform. Syst., vol. 3, pp. 1-20, Dec. 1979.

[19] W. C. Lin, R.C.T. Lee, and H. C. Du, "Common properties of some multi-attribute file systems," IEEE Trans. SoftwareEng., vol. SE-5, pp.160-174, Mar. 1979.

[20] J. H. Liou and S. B.Yao, "Multi-dimensionalclustering for data base organizations,"Inform. Syst., vol. 2, pp. 187-198, 1977.

[21] R. L. Rivest, "Analysis of associativeretrieval algorithms," Ph.D. dissertation, Dep. Comput. Sci.,'Stanford Univ., Stanford, CA, 1974.

[22] -,"Partial-match retrieval algorithms," SIAM J. Comput., vol. 15, No. 1, pp.19-50, Mar. 1976.

[231 J. B. Rothnie and T. Lozano, "Attribute based file organization in a paged memory environment," Commun. Ass. Comput. Mach., vol.17,pp.63-69,Feb. 1974.

[24] C. W.Shen and R.C.T.Lee, "A nearest neighbor searchtechnique with short zero-in-time," IEEE Trans. Software Eng., to be published.

[251 R. Sprugnoli, "Perfecthashing functions: Asingle probe retriev-ing method for static sets,"Commun. Ass. Comput. Mach., vol. 20, pp. 841-850, Nov. 1977.

C. C. Chang was born in Taiwan in 1954. He _ received the B.S. degree in applied mathe-matics in 1977and the M.S. degree in computer W and decision sciences in 1979,both from the

4 NationalTsingHuaUniversity.

i

~ : NaHeispresentlyaninstructoraswellas a

Ph.D-r student of theInstitute ofComputer

Engineer-i

mgin NationalChiao-Tung University, Hsinchu, l Taiwan. His research interests are in database w

El design, algorithm analysis, and statistics. R.C.T. Lee (A'74-M'75) received the Ph.D. degree from theUniversityofCalifornia,

Berke-ley,in1967.

He is currently the Director of the Institute of Computer and Decision Sciences, National Tsing Hua University, Hsinchu, Taiwan. He previously worked for NCR (California), Na-tional Institutes ofHealth(Bethesda,MD),and the Naval Research Laboratory (Washington, DC) before joining the National Tsing Hua University in 1975. He is the coauthor of Sym-bolic Logic and Mechanical TheoremProving(New York: Academic), which has beentranslated into bothJapaneseandRussian. His research in clustering analysis will appear as a chapter entitled "Clustering Analysis and its Applications" in Advances in Information Systems Science (New York: Plenum). He is the authorofmorethan 50 papers onmechanical theoremproving, databasedesign,and patternrecognition. M.W.Du(S'70-M'72)wasborninChung-King, China,in1944. He received theB.S.E.E.degree from the National Taiwan University in 1966 and thePh.D.degree fromTheJohns Hopkins University,Baltimore, MD, in 1972.

He is now the Director of the Institute of Computer Engineering, National Chiao-Tung University, Hsinchu, Taiwan. His research interests include fault diagnosis, automata theory, algorithm design and analysis, database design, and Chinesel/O design.