• 沒有找到結果。

4.1 擷取相關文章及相關句

4.1.1 擷取相關文章

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

11

圖 4.2 擷取相關文章及相關句流程

4.1.1擷取相關文章

此流程的主要目標即是將英文論述句的相關文章和相關句從維基百科與一些篩 選機制中挑選出來,所謂的英文論述句即是 3.1 節所介紹的英文語料集,圖 4.3 為英文論述句的範例,因為我們要向維基百科查詢有無相關的文章,因此必須先 從論述句中挑選出有效的關鍵詞彙,作為搜尋維基百科的關鍵詞,我們將此步驟 分為三個部分進行:

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

12

<pair label="Y" id="210">

<t2>United Nations member countries must accept and execute the decisions of the Security Council in accordance with the Charter of the United Nations.</t2>

</pair>

圖 4.3 論述句範例

第一步是取得論述句中名詞組合的近義詞,作為搜尋維基百科的關鍵詞;近 義詞的考慮在一些自然語言處理應用中,已成了不可或缺的重要考量,例如: I love United States 和 I love America 是描述同一件事情,正因為 United States 與 America 是近義詞都代表著美國,因此增加了近義詞的考慮,可以將維基百 科中許多相關的資訊也一併擷取出來。首先我們先利用 Stanford parser[20]標記出 論述句的詞彙相依性,如圖 4.4,透過詞彙與詞彙的相依關係,我們將表示為「nn」

的名詞組合詞彙擷取出來,並透過 WordNet 將擷取出的名詞組合找出其近義詞 組,如圖 4.5,最後將近義詞組作為我們向維基百科搜尋的關鍵詞。

Example of statement sentence:

United Nations member countries must accept and execute the decisions of the Security Council in accordance with the Charter of the United Nations.

Typed dependencies:

nn(countries-4, United-1), nn(countries-4, Nations-2) nn(countries-4, member-3), nsubj(accept-6, countries-4) aux(accept-6, must-5), root(ROOT-0, accept-6) cc(accept-6, and-7), conj(accept-6, execute-8) det(decisions-10, the-9), dobj(accept-6, decisions-10) prep(decisions-10, of-11), det(Council-14, the-12) nn(Council-14, Security-13), pobj(of-11, Council-14) prep(Council-14, in-15), pobj(in-15, accordance-16) prep(accept-6, with-17) , det(Charter-19, the-18) pobj(with-17, Charter-19), prep(Charter-19, of-20) det(Nations-23, the-21) , nn(Nations-23, United-22) pobj(of-20, Nations-23),

圖 4.4 詞彙相依性標記

Example of noun phrase:

Member Country, United Nation, Security Council Member Country Synonyms : none

United Nation Synonyms : none Security Council Synonyms : SC

圖 4.5 近義詞範例

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

14

第二步是取得論述句中的名詞,作為搜尋維基百科的關鍵詞;其方法與第一個相 去不遠,先使用 StanfordCoreNLP[18]進行詞性標記(Part of Speech)如圖 4.6 所 示,將標記出來為名詞的詞彙擷取出來,透過 WordNet 將擷取出的名詞找出其 近義詞組,最後將近義詞組也作為我們向維基百科搜尋的關鍵詞。

Part of Speech Tagging

United Nations member countries must accept and execute the decisions of the Security Council in accordance with the Charter of the United Nations.

United/NNP, Nations/NNP, member/NN, countries/NNS

must/MD, accept/VB, and/CC, execute/VB

the/DT, decisions/NNS, of/IN, the/DT

Security/NNP, Council/NNP, in/IN, accordance/NN

with/IN, the/DT, Charter/NNP, of/IN

the/DT, United/NNP, Nations/NNPS, ./.

圖 4.6 詞性標記

第三步是將詞彙中的二字詞、三字詞以及四字詞擷取出來亦作為搜尋維基百科的 關鍵詞,因為瀏覽過蒐集的關鍵詞後發現,Stanford tools 並沒有把一些人名或歷 史事件作為名詞片語,因此為了避免一些重要文章被忽略,我們將此步驟也納入 關鍵詞的搜尋中如圖 4.7 所示。

Unigram to 4-gram

United Nations member countries must accept and execute the decisions of the Security Council in accordance with the Charter of the United Nations.

Unigram:

United, Nations, member, countries, must, accept, and, execute, the, decisions, of, the, Security, Council, in, accordance, with, the, Charter, of, the, United, Nations Bigram:

United Nations, Nations member, member countries, countries must, must accept, accept and, and execute, execute the, the decisions, decisions of, of the, the Security, Security Council, Council in, in accordance, accordance with, with the, the Charter, Charter of, of the, the United, United Nations

Trigram:

United Nations member, Nations member countries, member countries must, countries must accept, must accept and, accept and execute, and execute the, execute the decisions, the decisions of, decisions of the, of the Security, the Security Council, Security Council in, Council in accordance, in accordance with, accordance with the, with the Charter, the Charter of, Charter of the, of the United, the United Nations

4-gram:

United Nations member countries, Nations member countries must, member countries must accept, countries must accept and, must accept and execute, accept and execute the, and execute the decisions, execute the decisions of, the decisions of the, decisions of the Security, of the Security Council, the Security Council in, Security Council in accordance, Council in accordance with, in accordance with the, accordance with the Charter, with the Charter of, the Charter of the, Charter of the United, of the United Nations

圖 4.7 連續詞彙範例

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

16

我們將三個步驟所擷取的詞彙,整合再一起,一併視為我們向維基百科搜尋 的關鍵詞彙;英語維基百科以超過 450 萬篇條目在數量上排名第一,以圖 4.8 所 示,我們將關鍵詞跟英文維基百科的條目作比對,若吻合該條目,則將該條目所 隸屬的文章擷取出來,並視為論述句的相關文章。

將上述所得詞彙皆視為向維基百科搜尋的關鍵詞彙(key words)

將 URL 中維基百科的文章擷取出來 http://en.wikipedia.org/wiki/key words 圖 4.8 關鍵詞擷取維基百科文章

擷取出的相關文章會有以下三種可能:

1. 有吻合條目,為正常文章內容。

2. 有吻合條目,但為相關文章導引,無實質內容。

3. 找無相符條目,為空文章。

在這三種可能中,我們只保留第一種情況,因此我們利用 total commander[24] 檔 案管理程式,將無實質內容以及空文章的檔案先行過濾,最後留下來的文章及為 相關文章,據統計平均一論述句經過濾後可搜尋到 5.66 篇文章。

4.1.2擷取相關文章

透過 4.1.1 節擷取每個論述句的相關文章後,開始要對文章作一些基本的前處 理,我們將文章中不必要的 XML 標籤以及參照去除掉,並透過 StanfordCoreNLP 將文章篇幅斷句,即文章的每個段落依據其標點符號將之斷成一個個的句子如圖

4.9 所示,就成了文章句,之後將每一個文章句使用 StanfordCoreNLP 進行詞性 標記(Part of Speech)得到每個句子中詞彙的詞性。我們從文章句中篩選相關句的 機制是將所有文章句中,與相對應的論述句作比對,其中將比對相符的詞彙且標

In relation, the Sun is personified as a goddess in Germanic paganism, Sól/Sunna.

Scholars theorize that the Sun, as a Germanic goddess, may represent an extension of an earlier Proto-Indo-European sun deity due to Indo-European linguistic connections between Old Norse Sól, Sanskrit Surya, Gaulish Sulis, Lithuanian Saulė, and Slavic Solntse.

Sentence1:

In relation, the Sun is personified as a goddess in Germanic paganism, Sól/Sunna.

Sentence2:

Scholars theorize that the Sun, as a Germanic goddess, may represent an extension of an earlier Proto-Indo-European sun deity due to Indo-European linguistic connections between Old Norse Sól, Sanskrit Surya, Gaulish Sulis, Lithuanian Saulė, and Slavic Solntse.

圖 4.9 文章斷句範例

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

18

相關文件