概念空間建構 - E Parallel Corpus - 英文摘要 - 多語言複合式文件自動摘要之研究(III)

英文摘要

C- E Parallel Corpus

4.1 概念空間建構

4.1.1 中英雙語平行語料庫

我們收集民視英語新聞[25]之中英文對照新聞，涵蓋範圍從 2003 年 5 月至 2005 年 3 月，共 6,821 組中英文對照的段落。斷詞切字的結果，中文部份共 16,506 個相異名詞及10,950 個相異動詞；英文部份共 10,687 個相異名詞及 2,767 個相異動詞。

4.1.2 中英文混合詞分群結果

Table 8 舉例說明某一 General 階層的分群的結果。此表格中，中英文對應的部份乃為人工對應的結果，就原本的詞群而言，並沒有該對應關係。由此表格可知，當所選取的階層越高時，亦及所分群的群數目越少時，該詞群所涵蓋的概念越廣，中文詞與英文詞間雖然有關係，但是並不能明確地知道其相關的對應翻譯。

Table 8: General 詞群結果舉例

中文詞英文詞

仙妮雅唐恩 shania 仙妮雅唐恩 twain 席琳狄翁 celine

席琳狄翁 dion

密西艾略特 missy 密西艾略特 elliot

傑西辛普森 –

鄉村 –

實力派 –

嘻哈教母 –

戴安基頓 dian

戴安基頓 keaton

Table 9: Specific 詞群結果舉例

詞群中文詞英文詞

0 克斯勒 quastler

1 腰身 curvacious

2 槍聲 bang

3 鄭明修 ming-hsiu

4 薛樂儀 le-yi

5 陳玉鳳 yu-feng

Table 9 列出某一 Specific 階層的分群中 6 個詞群結果。由此表可知，當所選取的概念階層越低時，亦及所分群的群數目越多時，該詞群所涵蓋的概念越細，

中文詞與英文詞間已經可以視為相對應的翻譯。然而，詞群中亦有可能包含少數的雜訊；如詞群5，該群共包含三個詞，分別為『陳玉鳳』、『餐飲店』及『yu-feng』。原文中，陳玉鳳為該餐飲店的老闆。

4.2 中英文段落對應分析

4.2.1 測試集介紹

我們從中央通訊社[24]收集 2005/5/8 至 2005/5/13 一周內的所有中英文新聞，先分別分對中英文文件作文件分群，在由人工將中英文文件群對應，以得到討論相同事件的中英文新聞，共得到10 個文件群。Table 10 列出 10 個文件群所包含的事件及相關中英文文件數目。

Table 10: 測試集文件群分析

事件中文文件數英文文件數

1 國代選舉民進黨獲127 席贊成修憲陣營大勝 6 4

2 謝揆：政府釋利多打造良好經濟環境 6 3

3 參與WHO 外交部：不接受矮化台灣地位安排 6 5

4 國代選舉蘇貞昌籲投民進黨完成國會席次減半 8 5

5 國代選舉民進黨對選情審慎樂觀 8 4

6 宋楚瑜抵北京重申兩岸兄弟一家親 8 7

7 兩岸一中宋楚瑜：兩岸各表一中與憲法一中 7 6

8 修憲辯論正方：關鍵性修憲讓憲法更嚴謹 7 5

9 民進黨團提選罷法修正案加重賄選刑責 4 5

10 學者：宋胡會共識壓縮政府談判空間 8 7

Table 11 為測試集中相異詞數目及其於中英雙語平行語料庫中涵蓋的比例。

平均而言，中文詞及英文詞分別約74%及 86%涵蓋於平行語料庫中。

Table 11: 測試集中相異詞數目及其於平行語料庫中的涵蓋比例

語言及詞性測試集中相異詞數目涵蓋於平行語料庫數目百分比

中文名詞 3,408 2,452 72.0%

中文動詞 2,870 2,281 79.5%

英文名詞 1,993 1,614 81.0%

英文動詞 893 811 91.0%

4.2.2 段落對應結果

Table 12 列出各個事件群中，考慮相似度最高的前 10 組中英對應段落，以

人工的方式標記是否相關的結果。該表格顯示平均約57%的中英對應段落為相關

的對應(亦即，平均的 Precision 為 57%)¹⁸。同時，我們發現當事件群中中英文件

的關係越相關時，如事件群5 與 6，其對應的結果越正確；當事件群中中英文件

的關係越不相關時，如事件群3，其正確對應的數目便越低。

Table 12: Top10 相似度高之中英段落中正確對應數目

事件群相關中英對應段落數目

1 6 2 5 3 3 4 6 5 8 6 8 7 6 8 5 9 4 10 6 平均 5.7

18 初始實驗結果因測試集雜亂導致結果不佳。我們縮減測試集，使得中英文文章的數量相近得到比較好的結果。

Table 13、Table 14 及 Table 15 列出三組正確的中英對應段落範例以供參考。

Table 13: 中英對照例一

中文他強調，堅持體驗一個中國的九二共識，堅持反對台獨，是兩岸對話、

協商的政治基礎，也是兩岸關係和平穩定發展的政治基礎。

英文 Hu stressed that insisting on the "one China" policy, the "1992 consensus,

" and opposing Taiwan independence would be the premise on which the resumption of cross-strait dialogue and negotiations would be based on.

Table 14: 中英對照例二

中文七個反對修憲案的政黨或聯盟為：台灣團結聯盟、親民黨、無黨團結

聯盟、建國黨、新黨、王廷興等二十人聯盟、張亞中等一百五十人聯盟；反對陣營得票率百分之十六點八六，共獲五十一席。

英文 Seven other parties and groups that oppose the proposed amendments won 51 seats altogether and are not expected to be able to stop the Constitution-amending juggernaut pushed by the two major parties.

Table 15: 中英對照例三

中文陳總統昨天表示，中國利用在野黨，介入干涉台灣五一四選舉；向美

方施壓指台灣憲改是法理台獨，要在野黨、美方阻擋台灣憲改。總統

力陳五一四選舉重要性A 憲改是台灣民主深化鞏固工程。

英文 According to the president, the Chinese official has requested that the United States and Taiwan's opposition parties try to stop the constitutional amendments from being adopted since Beijing considers that the constitutional reform process is aimed at achieving Taiwan independence.

4.3 中英混合式多文件摘要結果

中英混合式多文件摘要的生成，我們提供兩種摘要表現方式。一為以中文為主並附加英文於中文摘要後；另外一種則為相反。目前中英混合式多文件摘要的評估方式，我們以人工問卷的方式，由每個測試者閱讀每個事件群及摘要內容，

並評比該摘要內容的好壞。評比的維度，包含1) 摘要內容的資訊量涵蓋程度；

2) 摘要內容的可讀性。

實驗設計共有5 位專家，針對上述維度對每個事件群所產生的摘要內容進行

評比，給予不同的分數。分數的範圍為1~10，1 代表最差，5 代表普通，10 代表最好。Table 16 為人工評估的結果。平均而言，資訊量涵蓋度為 7.06，可讀性為 6.04。

Table 16: 摘要資訊量涵蓋度及可讀性評估

事件群資訊量涵蓋度可讀性

1 7.2 5.5 2 6.9 6.2 3 5.5 5.4 4 6.5 5.8 5 8.2 6.8 6 7.7 7.0 7 7.3 6.5 8 7.0 5.8 9 7.5 5.6 10 6.8 5.8

平均 7.06 6.04

Table 17 僅列出事件群 6 由系統所產生的摘要結果以供參考。

Table 17: 事件群 6 之摘要內容範例

他說，所謂搭橋，是為兩岸搭起互信之橋、合作之橋與溝通之橋。宋楚瑜也不是任何人的信差。 (pno: 3; article: 6; pdate: 2005-05-25) 親民黨主席宋楚瑜今天抵達北京，他感謝中共中央總書記胡錦濤的邀請，讓親民黨打破五十多年來兩岸政治禁忌，和中國共產黨進行黨與黨對話。他說，親民黨相信只要兩岸兄弟一家親，一定可以找到方法，解決兩岸過去的誤解。 (pno: 1; article: 78; pdate:

2005-05-25) 宋楚瑜的專機約在下午四時三十分抵達北京，中國國台辦主任陳雲林等人到場迎接。 (pno: 2; article: 78; pdate: 2005-05-25) 他說，以親民黨主席身分和親民黨和平工作團身分到北京，感謝中共黨中央與總書記胡錦濤的邀請，讓親民黨打破五十多年來兩岸政治禁忌，親民黨能和中國曳?珔 i 行黨與黨的對話。 (pno: 4; article: 78; pdate: 2005-05-25) 親民黨主席宋楚瑜今天與中國共產黨總書記胡錦濤會面時表示，親民黨三點基本立場堅定不移，堅定支持九二共識「一個中國」基本原則、從不認為台獨應是台灣選項、以及主張和平。

(pno: 1; article: 145; pdate: 2005-05-25) 胡錦濤說，國親兩黨主席連戰、宋楚瑜來訪，大陸同胞和台灣同胞都給予支持與歡迎，表明兩岸同胞認為這些做法符合他們的心願。 (pno: 9; article: 146; pdate: 2005-05-25) Soong, who arrived in Shanghai Friday on the third leg of his current nine-day visit to China, said when he meet with Chinese President Hu Jintao in Beijing May 12, he will urge China to adopt concrete measures to protect "taishang's" rights and interests. (pno: 2; article:

4; pdate: 2005-05-25) Soong flew from Hunan Province to Beijing on the fifth and most important leg of his nine-day "bridge-building" visit to China, where he will hold talks with Chinese President Hu Jintao, who serves concurrently as general secretary of the Communist Party of China. (pno: 2; article: 46; pdate: 2005-05-25)

5. 結論

本計畫為三年期研究『多語言複合式文件自動摘要之研究』之第二年計畫。

透過詞分群的方式，我們將中英雙語平行語料庫中的中英文詞進行分群分析，並建構階層式概念空間。對於測試的文件集，我們以段落為單位，將每個段落由關鍵詞的表示式(Word-Level Representation) 轉換成概念表示式 (Concept-Level Representation)，並組合不同階層概念空間所得到的相似度，以計算任兩中中、

中英及英英段落的概念相似度，最後得到多文件摘要結果。

由實驗結果中可知當所選取的階層越高時，亦及所分群的群數目越少時，詞群所涵蓋的概念越廣，中文詞與英文詞間雖然有關係，但是並不能明確地知道其相關的對應翻譯。當所選取的概念階層越低時，亦及所分群的群數目越多時，詞群所涵蓋的概念越細，中文詞與英文詞間已經可以視為相對應的翻譯。

中英文段落對應方面，目前實驗結果Top 10 的平均 Precision 為 57%。我們亦發現如Table 13、Table 14 及 Table 15 所示之結果確實驗證我們方法的可行性。

就摘要結果的好壞評估，我們以人工問卷的方式，由每個測試者閱讀每個事件群

及摘要內容，並評比該摘要內容的好壞。評比的維度，包含1) 摘要內容的資訊

量涵蓋程度；2) 摘要內容的可讀性。實驗設計共有 5 位專家，針對上述維度對

每個事件群所產生的摘要內容進行評比，給予不同的分數。分數的範圍為1~10，

1 代表最差，5 代表普通，10 代表最好。平均而言，資訊量涵蓋度為 7.06，可讀性為6.04。

參考文獻

[1] Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization.

In Proceedings of the ACL/EACL’97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain (pp. 10-17).

[2] Boros, E., Kantor, P. B., & Neu, D. J. (2001). A clustering based approach to create multi-document summaries. In Proceedings of the Document Understanding Conference (DUC-2001), New Orleans, LSA, USA.

[3] H.-H. Chen, J.-J. Kuo, S.-J. Huang, C.-J. Lin and H.-C. Wung, “A Summarization System for Chinese News from Multiple Sources,” Journal of the American Society for Information Science and Technology, 54(13), 1224-1236, 2003.

[4] H.-H. Chen and C.-J. Lin, “A Multilingual News Summarizer,” Proceedings of the 18^th International Conference on Computational Linguistics, pp. 159-165.

[5] H. P. Edmundson, “New Methods in Automatic Extracting,” Journal of ACM (JACM), 16(2), 264-285, 1969.

[6] D. Evans, J. L. Klavans, K. R. McKeown, “Columbia Newsblaster: Multilingual News Summarization on the Web,” Proceedings of Human Language Technology (HLT), Boston, MA, 2004.

[7] Goldstein, J., Mittal, V., Carbonell, J., & Kantrowitz, M. (2000). Multi-document summarization by sentence extraction. In Proceedings the ANLP/NAACL Workshop on Automatic Summarization, Seattle, WA (pp. 40-48).

[8] Gong, Y., & Liu, X. (2001). Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), New Orleans, LA, USA (pp. 19-25).

[9] Hovy, E., & Lin, C. Y. (1999). Automated text summarization in SUMMARIST.

Mani, I., & Maybury, M. (eds.), Advances in automated text summarization.

Cambridge, Mass.: MIT Press.

[10] G. Karypis, “CLUTO: Software Package for Clustering High-Dimensional Datasets,” http://www-users.cs.umn.edu/~karypis/cluto/index.html.

[11] Kupiec, J., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95), Seattle, WA, USA (pp. 68-73).

[12] Language Technology Group, “LT POS,” http://www.ltg.ed.ac.uk/software/pos/.

[13] H. P. Luhn, “The automatic creation of literature abstracts,” IBM Journal of Research and Development, 2(2), 159-165, 1958.

[14] Mani, I., & Bloedorn, E. (1999). Summarizing similarities and differences among related documents. Information Retrieval, 1(1-2), 35-67.

[15] I. Mani and M. Maybury (eds.), “Advances in automated text summarization,”

MIT Press, Cambridge, Mass, 1999.

[16] McKeown, K. R., Klavans, J. L., Hatzivassiloglou, V., Barzilay, R., & Eskin, E.

(1999). Towards multidocument summarization by reformulation: progress and prospects. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI’99), Orlando, FA, USA (pp. 453-460).

[17] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).

Introduction to WordNet: a on-line lexical database. Lexicography, 3(4), 235-312.

[18] M. F. Porter, “An Algorithm for Suffix Stripping,” Program, 14(3), 130-137, 1980.

[19] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,”

McGraw-Hill, 1983.

[20] G. Salton, A. Singhal, M. Mitra, and C. Buckley, “Automatic Text Structuring and Summarization,” Information Processing & Management, 33(2), 193-207, 1997.

[21] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document

在文檔中多語言複合式文件自動摘要之研究(III) (頁 61-69)