Two-Stage Search Method: String Matching Then Filtering 42

Chapter 3 Coding, I/O for POJ, and Text Processing

3.3 Search Problem with POJ Text

3.3.2 Two-Stage Search Method: String Matching Then Filtering 42

If a user wants to use SQL to search data with numbered POJ, like

“select * from dict_table where poj_field like ‘%input_str%,’”

some problems may be encountered:

(a) When a user searches for “a,” they do not want all of the words that contain “a” (such as “a2,” “ah,” and “khoan2”). Similarly, when a user searches for “o,” they do not want “ou,” “oh,” etc. to be displayed in the result list.

(b) When a user searches for “tan” “single, ╖,” they do not want the word

“taN” “carry with a shoulder pole, 㑼.” This is a case sensitive situation.

The problem presented in case (a) can be resolved via exact matching at the syllable level, rather than the word level. However, for case (b), most of the SQL commands are case insensitive, since it is designed for English, not POJ text.

The other problems we need to deal with are:

This is the same as in English; please compare with (b).

(d) The hyphen problem, where the definition of a word may leave some ambiguities. For example, some people write “chiah8-png7” “eat rice, 梇梗,” while others write “chiah8 png7.” In addition, some people use

the symbol “--” to mark the neutral sandhi, while others use “-”.

Therefore, we need to use a two-stage strategy when using numbered POJ to search POJ text. We utilized SQL commands for string matching and then wrote some functions to filter these preliminary results to get the final results.

The following is a POJ text match algorithm. We decompose the word into individual syllables to solve problem (d).

1. execute SQL commands and get preliminary results

2. put the user input string to a string array A, one syllable per cell 3. if the first character of a cell is uppercase, replace it with lowercase 4. m ী length of the string array

5. for each result record

6. put the POJ field data to a string array B, one syllable per cell 7. if the first character of a cell is uppercase, replace it with lowercase 8. nী length of the string array

9. for i= 0 to (n-m)

10. if (case sensitive match A[0], B[i] for length m) then true

11. else false

12. end if 13. next i 14. next

15. display the records which match is true

Fig 3 - 1 POJ Text Match Algorithm (Target: Word)

In line 1, we execute SQL commands to get the preliminary results. In line 2, we decompose the input string (a word) into syllable(s), and put them in array A in order. In line 3, we replace the first character with the lowercase for all cells.

Please refer to case (c). In line 4, we count the length of the input string; in other words, we count the number of syllable(s) of the input string. Let the number be m. Lines 5 to 14 show a for loop that is used to examine the results one by one.

In line 6, we decompose the result (a word) into syllable(s), and put them into array B in order. We replace the first character with the lowercase for all cells, and count the number of syllable(s) of the result in lines 7 and 8. Let the length be n. The for loop from line 9 to line 13 performs a case sensitive match. Note that the length of the input string may be less than the length of the result. If every syllable is the same, we mark it as true. In line 15, we finally print the results, which are marked true.

3.3.3 Query Expansions: Toneless, Glottal Stop, Checked

Syllable, and Vowel

Some people may not be familiar with the phonemes of the Taiwanese language, either because of the influence of Mandarin or from a lack of exposure.

On the other hand, because some want to investigate the phonetic phenomenon in detail, they need more powerful search tools. Query expansion may satisfy their needs.

For people who are not familiar with the phonemes of the Taiwanese language, some specific phonemes will be confused with other phonemes, including:

(a) Tone: there are 4 tones in Mandarin but 7 tones in Taiwanese. Some people can hardly distinguish the differences.

(b) Glottal stop: a glottal stop (-h) is a weakness of checked syllable, and has disappeared from some people’s pronunciation. Some people cannot distinguish between “a” and “ah,” which is a good example

of the contrast between a non-glottal stop and a glottal stop.

(c) Checked syllable: there are no checked syllables in Mandarin but four kinds of checked syllables (-p | -t | -k | -h) inTaiwanese. It is difficult for some to distinguish the difference.

A toneless search includes a glottal stop search. For a toneless search, we need to search for a syllable that has same beginning part and ends with “ ȴ -2|

| -3 | -h | -5 | -7 | -h8” (ȴ means empty).

The following is the POJ toneless search algorithm:

1. if the last part of user input is [1|2|3|h4|h|4|5|7|8|h8] then eliminate it 2. execute SQL command

3. for each result record

4. if the modified user input exactly match the left part of POJ field then 5. xীthe right part of POJ field (POJ field - modified user input) 6. if x = [ĳ|2|3|h|4|5|7|8|h8] then true

7. end if 8. next

9. display the records which match is true

Fig 3 - 2 POJ Toneless Search Algorithm (Target: Syllable)

In line 1, we eliminate the last part of the input if that is [1|2|3|h4|h|4|5|7|

8|h8]. In line 2, we execute the SQL command to match the modified input. The for loop from line 3 to line 8 examines each result. In line 4, we eliminate the result if the left part does not exactly match. We then examine the right part of the result in lines 5 and 6. If the right part is [ĳ|2|3|h|4|5|7|8|h8] then we mark it as true. Finally, we print the results marked as true in line 9.

For a checked syllable search, we need to search for a syllable that has the same beginning part and ends with “ -p | -t | -k | -h | -p8 | -t8 | -k8 | -h8.” This is

also a toneless search but non-checked syllables are excluded.

The following is the POJ checked syllable search algorithm:

1. if the last part of user input is [p|t|k|h|p8|t8|k8|h8] then eliminate it 2. execute SQL command

3. for each result record

4. if the modified user input exactly match the left part of POJ field then 5. xীthe right part of POJ field (POJ field - modified user input) 6. if x = [p|t|k|h|p8|t8|k8|h8] then true

7. end if 8. next

9. display the records which match is true

Fig 3 - 3 POJ Checked Syllable Search Algorithm (Target: Syllable)

This algorithm is the same as the POJ toneless search algorithm. We just replace the [1|2|3|h4|h|4|5|7|8|h8] with [p|t|k|h|p8|t8|k8|h8].

A vowel search is another idea. Someone may want to write a poem in which the vowels of the last syllable of each sentence are the same (rhyme);

others may want to observe some Han character phenomenon or rule for the same vowel.

A vowel search is more complicated. If a user wants to search for a nasal vowel, e.g. “aN”, the search system needs to search not only for “-aN-“, but also for the “ma- “, “na-“, and “nga-“ patterns. We also need to take notice of “m”,

“n”, and “ng”, as these can act as both consonants and vowels. We need to search in reverse order if we want to find the vowels “m”, “n”, or “ng”.

By the way, a vowel search is also a toneless search.

1. decompose user input to tone, vowel and consonant parts from right to left 2. execute SQL command, only match the vowel part

3. for each result record

4. l ী the leftside of vowel in POJ field 5. r ী the rightside of vowel in POJ field 6. if r=[2|3|5|7|8] and

l=[ĳ|p|ph|m|b|t|th|n|l|k|kh|g|ng|h|ch|chh|s|j] then true 7. next

8. display the records which match is true

Fig 3 - 4 POJ Vowel Search Algorithm (Data: Syllable)

In line 1, we decompose the input string into tone, vowel, and consonant parts, of which we need only the vowel part. The method will be described in Section 3.5.1. We then an execute SQL command to match the vowel part only in line 2. The for loop from line 3 to line 7 is used to filter the results one by one.

We extract the left side part of the modified input string in line 4 and the right side part of the revised input string in line 5 from the result. If the left side part is [ĳ|p|ph|m|b|t|th|n|l|k|kh|g| ng|h|ch|chh|s|j] and the right side part is [2|3|5|7|8] then this result is marked as true. Finally we print all of the results that are marked as true.

For example, when a user searches for the “a” vowel, the SQL search will find “pa2,” “kia7,” and “khang,” among others. The “pa2” result is correct (“p”

is a consonant and “2” is a tone), but “kia7” and “khang” are incorrect (the left part “ki” in “kia7” is not a consonant, and the right part “ng” in “khang” is not a tone).

在文檔中台語文處理技術：以變調及詞性標記為例 (頁 70-75)