第三章 研究方法
第二節 前置處理程序
前置處理程序包括斷句、斷詞和詞性標記、具名實體辨識、文本情感分析及 特徵值擷取,圖3.2.1 為前置處理程序的流程圖。原始語料文本會先進行斷句、
斷詞及詞性標記,再進行具名實體辨識和文本情感分析,並引入具名實體詞典,
以增加系統辨識度,最後擷取出文章作者意見和意見持有者的特徵值。
斷詞
詞性標記
具名實體 詞典
具名實體辨識 文本情感分析
特徵值擷取 原始語料集
訓練和測試 資料集 Standford coreNLP
Toolkit
圖3.2.1 前置處理程序流程圖
(一) 斷詞(Segmentation)
本研究使用 Stanford 大學所開發的套裝系統 Stanford CoreNLP Toolkit [19]
Word Segmenter 和 English Tokenizer 進行文章的斷句和斷詞,先將整篇文章以句 號為分隔點斷句,再逐一把文句斷詞。對於名詞縮寫表達的方式、名詞挾帶所有 格、不同的括號種類與雙引號等標點符號,若有斷句或斷詞的錯誤發生,將採取 本研究制定的方法修正結果。
例句3.2.1:State-owned CPC Corp. Taiwan is likely to raise gasoline and diesel prices by NT$0.6 (US$0.02)-NT$0.7 per liter next week after a week of unchanged prices, the sources said.
圖3.2.2 例句 3.2.1 之斷詞處理執行畫面
例句3.2.1 中的「Corp.」為 corporation 股份(有限)公司的英文縮寫,Stanford 斷詞系統將此處標記為一般句號,因此會產生遺失完整名詞實體的錯誤,且此句 原本為單獨的句子錯誤地被斷成兩句。為了將句子正確地斷句,本研究採取以下 方法進行修改動作:
規則(一):不與標記為帶有句號的名詞實體進行修改動作。
規則(二):若句號向前參考包含大寫的名詞實體,則視為相同實體名稱。
例句3.2.2:Mr. Y.H. Chang is a Dr. in the Corp.
圖3.2.3 例句 3.2.2 之斷句修改示意圖
若採用規則修改斷句和斷詞,例句3.2.2 中,「Corp」實體將與右邊第一個句 號結合,因此,第二句的句號將視為第一句的結尾,藉此方式解決斷句的錯誤。
例句3.2.3:Three little pigs’ courage defeats the big bad wolf’s scheme.
圖3.2.4 例句 3.2.3 之斷句修改示意圖
Stanford 斷詞系統中,例句 3.2.3 對於表達單、複數名詞所有格實體,都能 夠與其名詞實體正確地斷開,藉此推測斷詞系統計算的方式,字彙與標點符號需 為特定組合方式出現,對廣泛使用的搭配符號,系統則能夠正確斷詞,如:US$、
NT$、帶小數點的數字,系統都將這些視為相同實體組合,但卻會造成上述的名 詞縮寫斷詞不佳的情形,在此,本研究使用基於規則的方法解決了此問題。
(二) 詞性標記(Part of Speech Tagging)
本研究使用Stanford CoreNLP Toolkit [19] POS Tagger 進行意見句的詞性標記。
修改範例及結果如例句3.2.4 及圖 3.2.5 所示。
例句3.2.4:Russia's military action in Syria has also pushed up crude oil prices, the sources said.
圖3.2.5 例句 3.2.4 之詞性標記處理執行畫面
表3.2.1 為 Stanford POS Tagger 包含的詞性列表,對於特殊的文字表達格式 和符號都有其細數項目,如:forty-seven 並不會被標記為三種實體的詞性,經由 斷詞處理過後的文字和符號,能夠正確地標記出該字彙的詞性。
表3.2.1 Stanford part-of-speech tagger 詞性列表
Tag Description Examples
$ dollar $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
`` opening quotation mark ` ``
'' closing quotation mark ' '' ( opening parenthesis ( [ { ) closing parenthesis ) ] }
, comma ,
-- dash --
. sentence terminator . ! ? : colon or ellipsis : ; ...
CC conjunction, coordinating
& 'n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet CD numeral, cardinal mid-1890 nine-thirty forty-two one-tenth ten million
0.5 one forty-seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025 fifteen 271,124
dozen quintillion DM2,000 ...
DT determiner
all an another any both del each either every half la many much nary neither no some such that the them
these this those EX existential there there
FW foreign word
gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte terram
fiche oui corporis ...
IN
preposition or conjunction, subordinating
astride among uppon whether out inside pro despite on by throughout below within for towards near behind atop around if like until below next into if
beside ...
JJ adjective or numeral, ordinal
third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic
battery-powered participatory fourth
still-to-be-named multilingual multi-disciplinary ...
JJR adjective, comparative
bleaker braver breezier briefer brighter brisker broader bumper busier calmer cheaper choosier cleaner clearer closer colder commoner costlier
cozier creamier crunchier cuter ...
JJS adjective, superlative
calmest cheapest choicest classiest cleanest clearest closest commonest corniest costliest crassest creepiest crudest cutest darkest deadliest dearest
deepest densest dinkiest ...
LS list item marker
A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005 SP-44007 Second Third Three
Two \* a b c d first five four one six three two MD modal auxiliary can cannot could couldn't dare may might must need
ought shall should shouldn't will would
NN noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity
machinist ...
NNP noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl
CTCA Shannon A.K.C. Meltex Liverpool ...
NNPS noun, proper, plural
Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists Andalusians Andes Andruses Angels Animals Anthony Antilles
Antiques Apache Apaches Apocrypha ...
NNS noun, common, plural
undergraduates scotches bric-a-brac products bodyguards facets coasts divestitures storehouses
designs clubs fragrances averages subjectivists apprehensions muses factory-jobs ...
PDT pre-determiner all both half many quite such sure this POS genitive marker ' 's
PRP pronoun, personal
hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee
theirs them themselves they thou thy us PRP$ pronoun, possessive her his mine my our ours their thy your
RB adverb
occasionally unabatingly maddeningly adventurously professedly stirringly prominently technologically magisterially predominately swiftly
fiscally pitilessly ...
RBR adverb, comparative
further gloomier grander graver greater grimmer harder harsher healthier heavier higher however larger later leaner lengthier less-perfectly lesser
lonelier longer louder lower more ...
RBS adverb, superlative
best biggest bluntest earliest farthest first furthest hardest heartiest highest largest least less most
nearest second tightest worst
RP particle
aboard about across along apart around aside at away back before behind by crop down ever fast for forth from go high i.e. in into just later low more off
on open out over per pie raising start teeth that through under unto up up-pp upon whole with you SYM symbol % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R \*
\*\* \*\*\*
TO "to" as preposition or
infinitive marker to
UH interjection
Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen huh howdy uh dammit whammo shucks heck anyways whodunnit honey
golly man baby diddle hush sonuvabitch ...
VB verb, base form
ask assemble assess assign assume atone attention avoid bake balkanize bank begin behold believe bend benefit bevel beware bless boil bomb boost
brace break bring broil brush build ...
VBD verb, past tense
dipped pleaded swiped regummed soaked tidied convened halted registered cushioned exacted snubbed strode aimed adopted belied figgered speculated wore appreciated contemplated ...
VBG verb, present participle or gerund
telegraphing stirring focusing angering judging stalling lactating hankerin' alleging veering capping
approaching traveling besieging encrypting interrupting erasing wincing ...
VBN verb, past participle
multihulled dilapidated aerosolized chaired languished panelized used experimented flourished
imitated reunifed factored condensed sheared unsettled primed dubbed desired ...
VBP
verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate appear tend stray glisten obtain comprise detest tease attract emphasize mold
postpone sever return wag ...
VBZ
verb, present tense, 3rd person singular
bases reconstructs marks mixes displeases seals carps weaves snatches slumps stretches authorizes smolders pictures emerges stockpiles seduces fizzes
uses bolsters slaps speaks pleads ...
WDT WH-determiner that what whatever which whichever WP WH-pronoun that what whatever whatsoever which who whom
whosoever
WP$ WH-pronoun, possessive whose
WRB Wh-adverb how however whence whenever where whereby wherever wherein whereof why
(三) 具名實體辨識(Named Entity Recognition)
本研究使用Stanford CoreNLP Toolkit [19] Named Entity Recognizer 進行辨識,
目的在於擷取我們感興趣的具名實體(PERSON, ORGANIZATION, LOCATION),
幫助我們有效地得到潛在的意見持有者候選詞彙,如圖 3.2.6 所示,並結合我們 收集的具名實體詞典,能夠處理中文英譯姓名,至於比較不容易正確地辨識的實 體,藉由後續修改步驟,判定是否為正確的具名實體,並在未來可加入需要新增 的實體詞彙。
為了能夠更精確地標記出與意見持有者相關的具名實體,本研究加入額外的 具名實體詞典,包括中英譯名、職稱名、組織名縮寫和全名等字典,例如:中華 民國公營事業列表、外交部雙語詞彙對照表、行政院雙語詞彙對照表(職稱名)、
台灣地名中英對照表、中文姓名英譯表及相關國際組織專有名詞縮寫和全名列表
,若有無法判讀的意見持有者候選詞實體,則使用維基百科新增擴展字彙和一階 謂詞邏輯式,藉此判定不包括在實體資料庫的詞彙。
例句3.2.5:President Ma Ying-jeou said he would respect Kuomintang’s Central Standing Committee should it move forward on the decision to replace current presidential candidate Hung Hsiu-chu with KMT Chairman Eric Chu, reports said Thursday.
圖3.2.6 例句 3.2.5 之具名實體辨識執行畫面
Named Entity Recognizer 能夠正常地辨識大多數的具名實體,基於某些原因 會導致系統判定錯誤,如例句3.2.6 所示,實體「Hung」和「Ma」 無法被系統 有效地辨識為人名,若將輸入規模從單一個句子改為檔案層級,則同篇文章的前 後文句,包含的實體「Hung」和「Ma」卻能夠正確辨認,結果如圖 3.2.7。
例句3.2.6: "We hope both sides [Chu and Hung] can hold further talks to work out a solution," Ma said.
圖3.2.7 例句 3.2.6 之錯誤標記的具名實體-以未標示為例
除了無法正常辨識的實體之外,錯誤的具名實體標記而產生錯誤辨識,例句 3.2.7,實體「Hon Hai」錯誤辨識成人名,同理,若將輸入規模改為檔案層級,
同篇文章的前後文句,實體「Hon Hai」卻能正確地標記為組織名,結果如圖 3.2.8。
例句 3.2.8,實體「Hengchun」為地名卻錯誤被辨識為人名,原因可能為系統判 定實體「include」是屬於人物的動作,其前後文句搭配更多的詞彙出現時,則該 文句「Hengchun Airport」方能正確地標記為地名,結果如圖 3.2.9。
例句3.2.7:Hon Hai chairman donates cash dividends to bio-medical research.
圖3.2.8 例句 3.2.7 之錯誤標記的具名實體-以組織名為例
例句3.2.8:Not a single passenger arrived at or departed at the airport over the past year, even though Hengchun includes one of Taiwan’s most popular beach resorts and national parks.
圖3.2.9 例句 3.2.8 之錯誤標記的具名實體-以地名為例
我們觀察執行結果,並推測具名實體辨識系統錯誤判定的可能原因如下:
1.該詞彙並無收錄至系統實體資料庫。
2.錯誤的文法結構或不與系統匹配的句法結構。
3.文句篇幅過小,無法形成具有代表性的具名實體關係。
為了能夠正確地標記重要的具名實體,本研究採取以下方法進行修改動作:
規則(三):同篇文章中的其他句子若出現相同實體且正確標記則採用之。
規則(四):系統無法判定實體時,則利用一階謂詞邏輯得到適合的實體類別。
(四) 文本情感分析(Sentiment Analysis)
廣泛的意見通常包含了客觀的陳述和主觀的意見或是帶有情緒、情感的訊息,
並存在一個或數個意見的主題(Topic)與其主張性(Claim)。研究者必須取得這些資 訊才能產生有意義的分析結果,此外,意見持有者(Opinion Holder)相關的背景資 料也是重要的評估訊息。
然而,意見分析的工作並不容易達成,因為語言彈性非常大,難以用精簡的 規則加以表達或分析;語言存在的時空背景和情境狀況,也會影響所要表現的真 正意義;甚至語言表達客觀陳述或主觀意見並沒有明確的界線,以及語言訊息存 在冗贅的語句,意見表達的涵蓋範圍如圖3.2.10 所示。因此需要採用多種方式,
以利有效的分析工作,本研究使用 OpinionFinder 工具利用兩種分類器的結果,
得到隱性意見持有者的主觀性文句。
情感意見 意見句
主觀意見 客觀陳述
贅字與垃圾訊息
圖3.2.10 意見表達的涵蓋範圍[20]
MPQA (Multi-Perspective Question Answering) [18]是一個語料庫和意見辨識 系統(Corpus and Opinion Recognition System),該系統包含以下幾個部分:
1. 意見語料庫(MPQA Opinion Corpus):包含人工標記的新聞文章意見語句 2. OpinionFinder:能夠自動地辨識情感字彙,得到主觀性語句
3. 主觀性辭典(Subjectivity Lexicon):從意見語料庫得到的字彙表,包括主 觀性強度和正負極性等資訊
其開放軟體 OpinionFinder 能夠對文檔進行處理,並自動辨識主觀性和客觀 性語句,以及代表它們的情緒表達詞彙,本研究利用 OpinionFinder 內建的兩種
其開放軟體 OpinionFinder 能夠對文檔進行處理,並自動辨識主觀性和客觀 性語句,以及代表它們的情緒表達詞彙,本研究利用 OpinionFinder 內建的兩種