Data Structure and Algorithm Homework #3 Solution

(1)

Data Structure and Algorithm Homework #3 Solution TA email: dsa1@csie.ntu.edu.tw

Problem 1. Hash (25%) 1. (8%)

Open addressing with linear probing

initial 0 1 2 3 4 5 6 7 8 9 10

18 18

34 34 18

9 34 18 9

37 34 37 18 9

40 34 37 18 40 9

32 34 37 18 40 9 32

89 34 89 37 18 40 9 32

Open addressing with double hashing

initial 0 1 2 3 4 5 6 7 8 9 10

18 18

34 34 18

9 34 18 9

37 34 37 18 9

40 34 37 18 40 9

32 34 37 18 40 9 32

89 89 34 37 18 40 9 32

2. (6%) (a) (3%)

(2)

(b) (3%) insert 30

insert 49

insert 4

(3)

3. (5%) In the beginning, the two TAs must decide which of them transfers the message first.

(Assume T A1). In the end of the game, there is a verification to judge if T A1 tells a lie. If T A2finds that T A1is a liar, then T A1loses the game.

FLOW :

(1) T A₁ chooses which one to throw (e.g. rock) and a random string (e.g. 94879487), concatenates them to a string (rock94879487), and takes the string as input to SHA256.

(2) T A1transfers the output of SHA256 to T A2.

(3) When T A2receives the message from T A1, T A2decides which one to throw (e.g. paper) and transfers to T A1.

(4) T A1transfers the plaintext (rock94879487) to T A2, and T A2can use the plaintext and the hashed message received to verify whether T A1 has told a lie.

Now both of them know who wins the game. (In the example above, T A₁ wins.)

There are two nice properties in hash function like SHA256. One is that there is almost no collision in SHA256. (Take any two different strings as inputs to the hash function, you will get two different outputs.) Besides, SHA256 is a one-way function. That is, when you see the output of hash, you are not able to know what the input is. (The random string in the flow is to enlarge the input space so that T A2 cannot know T A1s raw input by brute-force search.) Now lets consider the flow, T A2 cannot know what T A1 throws due to the one-way property before FLOW 3. All T A2 can do is randomly choose one. And T A1cannot change his previous choice after receiving the message sent from T A2 because T A1 is not able to generate another input string such that its hash value matches the first-sent message. If T A1

tells a lie, T A₂ will know by the verification. This is an easy example of how hash can be used in the real word.

4. (6%)

(a) h1(k) = k mod 7 h2(k) = b^k₇c mod 7

k h1(k) h2(k)

6 6 0

31 3 4

2 2 0

41 6 5

30 2 4

45 3 6

44 2 6

(4)

The following are 2 hash tables over time. Columns show which element is inserted.

Table 1: table for h1(k)

6 31 2 41 30 45 44

0 1

2 2 2 30 30 44

3 31 31 31 31 45 31

4 5

6 6 6 6 41 6 6 6

Table 2: table for h2(k)

6 31 2 41 30 45 44

0 6 2 2 2

1 2 3

4 31 30

5 41 41 41

6 45

(b) Any element such that the insertion leads to an infinite sequence of displacements is a valid answer. For instance, we insert element 3. And all the elements included in the sequence of displacements are [3, 31, 30, 44, 45, 2].

(5)

Problem 2. Heap (15%)

1. (3%)

in-order number sequence : 3, 6, 4, 9, 5, 8, 10, 2, 7, 1 2. (3%)

in-order number sequence : 10, 5, 6, 4, 9, 7, 1, 8, 2, 3 3. (3%)

1 A t M o s t (int q , N o d e h e a d ) {

2 if ( h e a d != N U L L && head - > v a l u e <= q ) {

3 p r i n t head - > v a l u e

4 A t M o s t ( head - > r i g h t )

5 A t M o s t ( head - > l e f t )

6 }

7 }

The algorithm traverses its two children only when the root is qualified. Thus, the function will be called for at most 3k times. The time-complexity is O(k).

(6)

4. (6%) For each non-leaf node in array[i], its left-child and right-child are stored in array[2i]

and array[2i + 1], respectively. For example, the root is stored in array[1], and its left-child and right-child are stored in array[2] and array[3], respectively.

For a heap with depth=H, it takes 0 action to modify the leaves; 1 action to modify nodes at depth (H-1); 2 actions for nodes at depth (H-2) .... and at most H actions to modify the root at depth 0.

The total time = H + 2 × (H − 1) + 4 × (H − 2) + · · · + 2^H−1× 1 Let S = H + 2 × (H − 1) + 4 × (H − 2) + · · · + 2^H−1× 1

2S = 2H + 4 × (H − 1) + 8 × (H − 2) + · · · + 2^H× 1 2S − S = 2^H+1− 2 − H = S

and H = log(N + 1) − 1

S < 2^{log(N +1)}− 1 − log(N + 1) ⇒ O(N ) You can also refer to lecture PPT for the proof.

(7)

Problem 3. Trie Trie See (20%)

1. (5%)

root

A

P

L

E

B

A

N

A

N

A S

E K

E

T

2. (5%)

1 v o i d i n s e r t (c h a r w o r d [] , int N ) {

2 s t r u c t N o d e * cur = R O O T ;

3 for(int i =0; i < N ; i ++) {

4 int no = w o r d [ i ] - ’ a ’;

5 if ( cur - > c h i l d r e n [ no ] == N U L L ) {

6 cur - > c h i l d r e n [ no ] = n e w _ n o d e () ;

7 }

8 cur = cur - > c h i l d r e n [ no ];

9 }

10 cur - > i s _ w o r d = 1;

11 }

12

13 v o i d d e l e t e(c h a r w o r d [] , int N ) {

15 for(int i =0; i < N ; i ++) {

16 int no = w o r d [ i ] - ’ a ’;

18 r e t u r n;

19 }

21 }

22 cur - > i s _ w o r d = 0;

23 }

24

25 int q u e r y (c h a r w o r d [] , int N ) {

(8)

27 for(int i =0; i < N ; i ++) {

28 int no = w o r d [ i ] - ’ a ’;

30 r e t u r n 0;

31 }

33 }

34 r e t u r n cur - > i s _ w o r d ;

35 }

For each function, there is only one for-loop which runs at most N times. So we will traverse N+1 node at most. In the insert function, we will malloc N nodes at most and each malloc will run in constant time.

3. (5%) First construct a trie T1 from the N words, and then another trie T2 for all the N reversed words. For example, if the words are as in Problem 4.1, then the second trie is constructed from elppa, ananab, ppa, esab and teksab.

While building the two tries, one additional step is to accumulate a count value (can simply use the tag variable) whenever a node is traversed. In other words, the tag on each node indicated how many end-of-word nodes are in the subtree rooted at the node. For example, if the words are as in Problem 4.1, then the trie T1 would be as follows (where in each node the tag is added):

root

A, 2

P, 2

L, 1

E, 1

B, 3

A, 3

N, 1

A, 1

N, 1

A, 1 S, 2

E, 1 K, 1

E, 1

T, 1

Obviously the above can be done in O(PN

i=1|W_i|) time. The rest is just to output the tag on the node of S in the trie according to the type of queries, which can be done in O(|S|) time. Pseudo code is as follows:

(9)

1 c o n s t r u c t t r i e s T1 , T2 for o r i g i n a l w o r d s and r e v e r s e d words , r e s p e c t i v e l y , w h i l e m a i n t a i n i n g tag as s t a t e d a b o v e

2 for e a c h q u e r y q = ( S , t y p e ) do

3 if t y p e is p r e f i x :

4 let T = T1

5 e l s e:

6 let T = T2

7 if S is not on the t r i e T , i . e . q u e r y ( S ) == 0: o u t p u t 0 and c o n t i n u e

8 l o c a t e the n o d e n on T for S

9 o u t p u t n . tag

4. (5%)

1 c o n s t r u c t a t r i e T f r o m the N w o r d s

2 for e a c h l e a f n o d e L of T : m a r k L as l o s e

3 do

4 for e a c h non - m a r k e d n o d e L of T w h o s e c h i l d r e n are m a r k e d :

5 if any c h i l d of L is m a r k e d l o s e : m a r k L as win

6 e l s e: m a r k L as l o s e

7 u n t i l all n o d e s are m a r k e d

8

9 if T . r o o t is m a r k e d win :

10 o u t p u t the f i r s t p l a y e r has a w i n n i n g s t r a t e g y

11 e l s e:

12 o u t p u t the s e c o n d p l a y e r has a w i n n i n g s t r a t e g y

Note that this algorithm, particularly the do-until part, can be implemented with recursion.

The mark operation can be done with the tag variable.

Analysis: In each step, the current word has to be the prefix of at least one of the N words.

This means a state of the game can be viewed as a node on the trie, and the string from the root to the node is the current word. In this way, each move of a player is just moving down the trie by one node.

Now we want to find out whether a player has a winning strategy on each node if it’s his/her turn. Clearly on the leaf nodes the player is going to lose, since he can’t move down anymore.

So we mark each leaf node as lose. If any of the children of a node is marked lose, the player can simply go down to that node, so that the other player will definitely lose. In this case we mark the node to be win. Otherwise, we mark the node as lose. Finally, the mark on the root node indicates whether the first player has a winning strategy.

With a recursive approach, each node will be visited exactly once, so the time complexity is linear to the sum of length of the N words. (You don’t need to show this.)

(10)

Problem 4. Machine Manager (Programming problem) (20%)

First of all, this problem has relatively large amount of requests, i.e., about 10⁸ requests.

Therefore, even just storing all the requests without doing anything will cause large usage of memory. Since each request consisting of (L + 1) characters, total memory usage will be (L + 1) ∗ Q bytes (char takes 1 byte). Even for 40% points, it will take about 300 MB which already exceeds the memory limit(256 MB). Therefore, we should handle the requests sequentially without memorizing it.

To deal with a request, we need to check or activate some given machine. Since we already know that we can’t afford storing all the requests, we should at least store the state of machine.

Then, how many states should we keep?

From the code generating requests, we can find that the character of machine’s name is something ended with 63. In C, and(&) is bit-wise and, 63 is 1111112. Thus, the character will have only first 6 least significant bits being 1, i.e., only 2⁶ choices. Thus, there are at most (2⁶)^L different machine names. Supposed we assign some memory space for each machine name, we can easily tag on its space and check whether it’s active or not by the tagging.

The most easily way to assign a memory space to something and have O(1) random access time is using array. In C, array index should be non-negative integer. But, now, the machine name is a string which cannot directly be an index of an array. Therefore, we should map(hash) each string to an integer. However, if the hashing function is too bad, some collisions may happen.

Theoretically, we need a hash table with ((2⁶)^L)² slots to lower the expected collisions less than 1. However, if we choose hash function carefully, in this specific problem, we can only need a hash table with exactly (2⁶)^L slots without any collision. Since each character only takes 6 bits, we can concatanate L characters as 6 × L bits as an integer. Code showed as follow:

1 int h a s h _ v a l u e ( c h a r* s , int L ) {

2 int ret = 0;

3 for( int i = 0 ; i < L ; i ++ )

4 ret = ( ret < < 6 ) | s [ i ];

5 r e t u r n ret ;

6 }

For 40% points, we have L ≤ 3. Thus, we need 2¹⁸ slots. We can just create an int array which will take 2¹⁸× 4 = 2²⁰= 2MB which easily fit in memory limit.

For 100% points, if we still create an int array, the memory usage will be 2³⁰× 4 = 2³²= 4GB which exceeds memory limit much. Even you use a char array, it will take 2³⁰ = 1GB memory space which still can’t fit in memory limit. Then, how much space could a slot take? we can just calculate it out as ^256MB₂30 =²₂²⁸30^B = ²³¹₂^bits30 = 2bits. Thus, for each slot, we can barely use up to 2 bits to store it. Actually, it’s enough. Since we only need to set it as active and check it is active

(11)

or inactive. It’s an 0/1 state. Thus, we just need to use 1 bit for each machine. The method to map each integer to some bit can be achived by bit operation. Code showed as follow:

1 # d e f i n e S L O T S (1 < <30)

2 # d e f i n e S I Z E 32

3 int h a s h _ t a b l e [ S L O T S / S I Z E ]

4 v o i d a c t i v a t e ( int id ) {

5 h a s h _ t a b l e [ id / S I Z E ] |= (1 < < ( id % S I Z E ) )

6 }

7 b o o l c h e c k ( int id ) {

8 r e t u r n ( h a s h _ t a b l e [ id / S I Z E ] > > ( id % S I Z E ) ) & 1;

9 }

(12)

Problem 5. Yea, I’m a router. (Programming problem) (20%)

Here we provide a solution with trie. The basic concept is to first (1) build a trie with all the given masks, and then (2) traverse the trie from root for each of the given IP addresses. If the traversing reachese an accept node, output ”TRUE” for the corresponding IP address; otherwise, output ”FALSE”.

Pseudocode

In the following two sections are two pseudocode snippets for the previously mentioned 2 steps.

Building Trie with Masks

The following build() function takes a trie and a mask, and augments the trie to accept this mask. The root parameter is the root node of the trie, and a node supports these data members:

left, right, and accept. The accept field stores a boolean indicating that the trie, i.e., a compound subnet mask, accepts any IPs beyond this point.

1 f u n c t i o n b u i l d ( root , mask_ip , m a s k _ b i t s ) :

2 c u r r e n t = r o o t

3 for i in 0 to m a s k _ b i t s -1:

4 if m a s k _ i p [ i ] == ’ 0 ’:

5 # c h e c k w h e t h e r we can go l e f t

6 if ! c u r r e n t . l e f t :

7 c u r r e n t . l e f t = c r e a t e _ n o d e ()

8 # go l e f t

9 c u r r e n t = c u r r e n t . l e f t

10 e l s e:

11 # c h e c k w h e t h e r we can go r i g h t

12 if ! c u r r e n t . r i g h t :

13 c u r r e n t . r i g h t = c r e a t e _ n o d e ()

14 # go r i g h t

15 c u r r e n t = c u r r e n t . r i g h t

16 c u r r e n t . a c c e p t = T r u e

Time Complexity: If we look into this function build(), you’ll find it runs in O(1) time since the only loop in the function would not run for more than 32 times, given that an IP is 32-bit long.

Therefore, the total time complexity for building the trie with M masks is O(M).

Traversing Trie with IP addresses

The function verify() below checks if an IP is accepted by the trie. It returns True if the IP address is accepted, False otherwise. Note that the parameter ip is a bit string.

(13)

1 f u n c t i o n v e r i f y ( root , ip ) :

2 c u r r e n t = r o o t

3 for i in 0 to 31:

4 if ip [ i ] == ’ 0 ’:

5 # c h e c k w h e t h e r we can go l e f t

6 if ! c u r r e n t . l e f t :

7 r e t u r n F a l s e

8 c u r r e n t = c u r r e n t . l e f t

9 e l s e:

10 # c h e c k w h e t h e r we can go r i g h t

11 if ! c u r r e n t . r i g h t :

13 c u r r e n t = c u r r e n t . r i g h t

14 if c u r r e n t . a c c e p t :

15 r e t u r n T r u e

Time Complexity: Similar to the build() function, each calling of verify() will be O(1) time complexity, and therefore, for verifying N IP addresses, the process takes O(N) time complexity.

Summary

With both build() and verify(), we can do the firewall checking in O(M+N) time complexity.

However, please note that this is NOT the only solution to the problem.