STRING MATCHING

(1)

Prof. Michael Tsai

2017/03/21

(2)

問題 : 字串比對

• 陣列 T[1..n] 中有一個長度為 n 的字串

• 陣列 P[1..m] 中有一個長度為 m 的字串

• 要在 T 中找 P 是否出現

• P 和 T 的字串從一個字元的集合中拿出

• 如 : 或

• Pattern P occurs with shift s in text T

(Pattern P occurs beginning at position s+1 in text T) if , for .

• If P occurs with shift s in T, we call s a valid shift. Oth erwise, we call s an invalid shift.

• 字串比對問題是要在 T 中間找到所有 P 出現的位置 (valid shift)

•

(3)

一些定義

• : 所有使用中字元組成的有限長度字串 ( 包括長度為 0 的空字串 )

• : 字串 x 的長度

• xy: 把字串 x 和 y 接起來 (concatenation)

• : 字串 w 是字串 x 的 prefix ( 也就是 x=wy, ) ( 表示 )

• : 字串 w 是字串 x 的 suffix ( 也就是 x=yw, ) ( 表示 )

• 例如 : ab abcca, cca abcca

• 空字串為任何字串的 prefix & suffix

• 對任何字串 x, y 和字元 a, iff

• 和為 transitive( 具遞移律 ) 的 operator

•

(4)

方法一 : 笨蛋暴力法

Native-String-Matcher(T, P)

N=T.length M=P.length

for s=0 to n-m

if P[1..m]==T[s+1..s+m]

print “Pattern occurs with shift” s

�(�)

n-m+1 次

�

( ⁽ ^{�− �+1} ⁾ ^� )

(5)

為什麼不好 ?

•

因為每次 for 執行比對 , 如果錯了 , 這回合的資訊完全丟掉 .

•

例 : P=aaab

•

如果我們發現 s=0 是 valid shift ( 表示 T 開頭為 aaab),

•

那麼從之前的結果應該可以知道 shift 1, 2, 3 都可以直接跳過 , 不需要一一比對 .

a a a b …

a a a b

T P

a a a b

(6)

方法二 : The Rabin-Karp Algorithm

•

假設

•

那麼每個長度為 k 的字串可以想成是一個 k 位數的十進位數

•

例如字串” 31415” 可以想成是十進位數 31415

•

把設為代表 T[s+1..s+m] 的十進位數

•

p 設為代表 P 的十進位數

•

那麼 iff T[s+1..s+m]=P[1..m], 也就是 s 是 valid shif t

•

怎麼從 P 計算 p 呢 ?

•

P[1

] P[2

] P[3

] … P[m-

1] P[m ]

(7)

方法二 : The Rabin-Karp Algorithm

•

怎麼從 P 計算 p 呢 ?

•

T[1 ]

T[2 ]

T[3 ]

… T[m- 1]

T[m ]

T[m+

1]

T[m +2]

…

�=0

�=1

拿掉最左邊那一格整個往右移一格

加上最右邊那一格

(8)

方法二 : The Rabin-Karp Algorithm

•

那麼用這個方法要多花少時間呢 ? ( 簡易分析版 )

•

然後用 , 算的值 ( 每次都是 constant time, 共 n-m 次 )

•

所以總共 : preprocessing 時間 , 比對時間

• �(�)

�(�)

(9)

方法二 : The Rabin-Karp Algorithm

•

之前的兩個問題 :

1.

如果是 general 的 character set, 怎麼辦 ? ( 不再是 {0,1,…,9})

如何解決呢 ?

假設 , . 可以把之前的式子改成

(a) 把整個 string 看成一個 d 進位的數 . (b) 字元在中 index 當作該字元所代表的的值

•

(10)

方法二 : The Rabin-Karp Algorithm

2.

當 m 比較大的時候 , p 和將很難用電腦直接處理 ( 用 lo ng long 也存不下 )

 加一加乘一乘最後總是會 overflow

•

如何解決 ? 利用同餘理論 .

•

Michael Rabin Richard Karp

(11)

同餘理論 (Modular Arithmetic)

•

假設 a, b 都為整數

•

: 表示 a 和 b 除以 n 的餘數相等

•

例如 :

•

更棒的性質 :

•

則

•

(12)

同餘理論 (Modular Arithmetic)

• 則

•

^•_• ^{證明 :}_表示

• 表示

• 所以

• 兩者餘數相同 !

• 得證 .

•

(13)

同餘理論 (Modular Arithmetic)

• 則

•

^•_• ^{證明 :}_表示

• 表示

• 所以

•

• 兩者餘數相同 ! 得證 !

•

(14)

The Rabin-Karp Algorithm 修正版

•

取 q 使得 dq 可以用一個電腦 word (32-bit or 64-bit) 來表示

•

既然 mod 後再加 , 減 , 乘也會保持原本的關係 , 我們可以把這些 operation 都變成 mod 版本的

•

(15)

The Rabin-Karp Algorithm 修正版

• 新的 mod 版 algorithm 會造成一個問題 :

• 雖然

• 但

• 例如 ,

• 但是如果

• 所以演算法變成這樣 :

1. 如果 , 那麼現在這個 s 為 invalid shift

2. 如果 , 那麼必須額外檢查

( 直接比對範圍內的字串花很多時間 )

• 當 , 但是時 , 稱為 spurious hit

• 當 q 夠大的時候 , 希望 spurious hit 會相當少

•

(16)

例子 : Rabin-Karp Algorithm

(17)

Pseudo Code: Rabin-Karp

Rabin-Karp-Matcher(T,P,d,q) n=T.length

m=P.length h= mod q p=0

t=0

for i=1 to m

p=(dp+P[i]) mod q t=(dt+T[i]) mod q for s=0 to n-m

if p==t

if P[1..m]==T[s+1..s+m]

print “Pattern occurs with shift” s if s<n-m

t=(d(t-T[s+1]h)+T[s+m+1]) mod q

• T: string to be searched P: pattern to be matched d: size of the character set q: max number

Pre-processing:

Hit 的時候比對 :

迴圈跑 n-m+1 次

(18)

Worst-case Running Time

t=0

for i=1 to m

if p==t

if P[1..m]==T[s+1..s+m]

•

Pre-processing:

迴圈跑 n-m+1 次

Worst case 的時候 : T= (n 個 a)

P= (m 個 a) 比對的時間為 O(m(n-m+1))

(19)

Average Running Time

t=0

for i=1 to m

if p==t

if P[1..m]==T[s+1..s+m]

•

Pre-processing:

迴圈跑 n-m+1 次

平常的時候 , valid shift 很少 ( 假設有 c 個 )

不會每次都有 modulo 的 hit.

假設字串各種排列組合出現的機率相等

，則 spurious hit 的機率可當成 .

則比對花的時間 :

Spurious hit 共花 O( (n- m+1)/q)=O(n/q) 次

總共比對花的時間為

O((n-m+1)+(m(c+n/q))) If c=O(1) and qm,  O(n+m)=O(n)

(20)

方法三 : The Knuth-Morris-Pratt Algo.

一樣耶 ! 表示等一下可以從 P 第 4 個位置開始比 T 不用倒退 ! 時間複雜度會降

低 !

(21)

Knuth-Morris-Pratt

Don Knuth James Morris Vaughan Pratt

(22)

正式一點的說法

•

假設 P[1..q] 和 T[s+1..s+q] 已經 match 了

•

要找出最小的 shift s’ 使得某個 k<q 可以滿足

• If 的 suffix

找的最長 prefix 使得它是的 suffix

s’=s+(q-k)

: P[1..q]

(23)

最好的狀況下 : 沒有重複的 pattern

… a b c a b f

a b c d

… a b c a b f

a b c d

s q

s’=s+

q k=0

(24)

先處理 P 來取得”重複 pattern” 的資訊

•

定義 Prefix function (failure function) :

•

Input: {1,2,…,m}

•

Output: {0,1,…,m-1}

•

( 也就是前面例子中的 k 值 , 可以想成最長的重複 pattern 的長度 )

• 找的最長 prefix 使得它是的 suffix

找的最長 prefix 使得它是的 suffix

(25)

Prefix function example

i 1 2 3 4 5 6 7

P[i] A B A B A C A

i 1 2 3 4 5 6 7

P[i] A B A B A C A

i 1 2 3 4 5 6 7 8 9 1

0 1

1 1 2

P[i] A B A B A C A B A B A B

i 1 2 3 4 5 6 7 8 9 1

0 1

1 1 2

P[i] A B A B A C A B A B A B

0 0 1 2 3 0 1

� [ � ] =max { ^{�:�<�∧�}

�

is a suffix of �

_�

}

(26)

Pseudo-code: Prefix function

Compute-Prefix-Function(P) m=P.length

let be a new array

k=0

for q=2 to m

while k>0 and P[k+1]!=P[q]

k=

if P[k+1]==P[q]

k=k+1 return

•

(27)

例子 : Matching

•

Ex. 1: T=BACBABABAABCBAB

•

Ex. 2: T=BABABABACA

•

實際的 Matching Pseudo Code 和計算 prefix function 非常像

•

請見 Cormen p. 1005 KMP-Matcher

i 1 2 3 4 5 6 7

P[i] A B A B A C A

0 0 1 2 3 0 1

i 1 2 3 4 5 6 7

P[i] A B A B A C A

0 0 1 2 3 0 1

(28)

Compute-Prefix-Function(P) m=P.length

let be a new array k=0

for q=2 to m

while k>0 and P[k+1]!=P[q]

k=

if P[k+1]==P[q]

k=k+1 return

•

共�

( � )

_次

麻煩的是這邊 :

總共會跳多少次呢 ?

算 Prefix function 花多少時間 ?

(29)

Compute-Prefix-Function(P) m=P.length

let be a new array

k=0

for q=2 to m

while k>0 and P[k+1]!=P[q]

k=

if P[k+1]==P[q]

k=k+1 return

•

算 Prefix function 花多少時間 ?

k 只會在這邊增加 ,

因此最多總共增加 m-1 次 ( 迴圈執行次數 ) 進入迴圈的時候 k<q, 且 q 每次增加 , k 有時候不增加

所以 k<q 永遠成立

所以每執行一次迴圈就減少 k 一次

且 k 永遠不是負的所以 .

最後 : 既然有增加才有得減少 , while loop 總共執行的次數不會超過 O(m)

Total:

O(m)

(30)

KMP 執行時間

•

類似的方法可以證明比對的部分執行時間為 O(n)

•

所以總和來看 :

•

Preprocessing 時間 O(m)

•

比對時間 O(n)

(31)

Reading Assignment

•

Textbook (Cormen)

ch. 32, 32.1, 32.2, 32.4 ( 正確性的證明略為複雜 )

STRING MATCHING

Prof. Michael Tsai

2017/03/21

問題 : 字串比對

•

一些定義

•

方法一 : 笨蛋暴力法

�(�)

�

( ( �− �+1 ) � )

為什麼不好 ?

•

•

•

•

方法二 : The Rabin-Karp Algorithm

•

•

•

•

•

•

•

•

方法二 : The Rabin-Karp Algorithm

•

•

�=0

�=1

方法二 : The Rabin-Karp Algorithm

•

•

•

•

�(�)

�(�)

方法二 : The Rabin-Karp Algorithm

•

1.

•

方法二 : The Rabin-Karp Algorithm

2.

•

•

同餘理論 (Modular Arithmetic)

•

•

•

•

•

•

同餘理論 (Modular Arithmetic)

• 則

•

•

同餘理論 (Modular Arithmetic)

• 則

•

•

The Rabin-Karp Algorithm 修正版

•

•

•

The Rabin-Karp Algorithm 修正版

•

例子 : Rabin-Karp Algorithm

Pseudo Code: Rabin-Karp

•

T: string to be searched P: pattern to be matched d: size of the character set q: max number

Worst-case Running Time

•

Worst case 的時 候 : T= (n 個 a)

P= (m 個 a) 比對的時間為 O(m(n-m+1))

Average Running Time

•

平常的時候 , valid shift 很少 ( 假設有 c 個 )

不會每次都有 modulo 的 hit.

假設字串各種排列組合出現的機率相等

，則 spurious hit 的機率可當成 .

( ⁽ ^{�− �+1} ⁾ ^� )

Worst case 的時候 : T= (n 個 a)

� [ � ] =max { ^{�:�<�∧�}