參考文獻 - 以LDA為基之英文課程文字稿摘要法

Arora, R Multi-Do worksho Blei, D.M Dirichlet Chang, documen Internatio pp.1689-Chen, Y.N

"Improve space," IC

Christen extractiv Proc. IE Worksho Christen text sum

獻

t Allocatio

Y.L. and nt summar

onal Conf -1692.

N., Chen, ed spoken CASSP 20 sen, H., G ve text sum

EEE Au vindran, B Summariz lytics for n

A.Y., Jor on," Journ Chien, J rization,"

ference on

, C.P., Le term dete 011.

Gotoh, Y.

mmarisatio utomatic

Kolluru, B

zation," A noisy unst rdan, M.I.

nal of Mach J.T. (200

ICASSP n Acousti

ee, H.Y., C ection wit

., Kolluru on techniq Speech

B., Gotoh, -specific s

ＤＡ主题 p20-22,46 要」，碩士

"Latent D AND '08 P

tructured t ., and La hine Learn 9). "Late

'09 Proce ics, Speec

Chan, C.A h graph-b

u, B., and ques portab

Recognit

Dirichlet A Proceeding

text data, p afferty, J.

ning, Vol.

nt Dirich eedings of ch and Si

A., and L based re-ra

d Renals, ble to bro tion and

Renals, S ation for b

自动文摘方

國立屏東商

Allocation gs of the pp91-97.

(2003).

3, pp.993 hlet learni f the 2009 ignal Proc

Lee, L.S.

anking in

S. (2003) adcast new

Underst

S. (2004).

broadcast

方法」，

商業技

n Based second

"Latent 3-1022.

ing for 9 IEEE cessing,

(2011).

feature

). "Are ws?" in tanding

"From news,"

9. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis," Journal of the American Society for Information Science, Vol. 41(6), pp. 391–407.

10. de Marnffe, M, Manning, C.D, (2008). "Stanford typed dependencies manual," pp.1-12.

11. Haghighi, A. and Vanderwende, L. (2009). "Exploring content models for multi-document summarization," Proc. of NAACL-HLT.

12. Hennig, L. (2009). "Topic-based multi-document summarization with probabilistic latent semantic analysis," RANLP'09.

13. Hori, T., Hori, C., and Minami, Y. (2003b). "Speech summarization using weighted finite-state transducers," in Proc. Eurospeech.

14. Kolluru B., Christensen H., and Gotoh Y. (2005). "Multi-stage compaction approach to broadcast news summarization," Proceedings of Eurospeech, pp. 69-72.

15. Kong, S.Y. and Lee, L.S. (2006). "Improved spoken document summarization using probabilistic latent semantic analysis (plsa)," in Proc. of ICASSP, 2006.

16. Liu, N., Tang, X., Lu, Y., Li, M., Wang, H., Xiao, P. (2014)

"Topic-Sensitive Multi-document Summarization Algorithm ," Parallel Architectures, Algorithms and Programming (PAAP), 2014 Sixth International Symposium on,, pp. 69-74

17. Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B. (2002). "SUMMAC: a text summarization evaluation," Natual Language Engineering, Vol. 8(1), pp. 43-68.

18. Michal, C., Karel, J. (2013). "Comparative Summarization via Latent

19. Murray, K. M. (2009). "Summarization by Latent Dirichlet Allocation:

Superior Sentence Extraction through Topic Modeling," A senior thesis for Bachelors degree, Princeton University.

20. Muthukkaruppan, A., Siti, F.N.M. (2014). "Content Quality of Clustered Latent Dirichlet Allocation Short Summaries," Information Retrieval Technology Lecture Notes in Computer Science, Volume 8870, pp.

494-504.

21. Misra, H., Yvon, F., Cappé, O., and Jose, J., (2011). "Text segmentation:

A topic modeling perspective," Information Processing and Management Vol. 47, pp.528–544.

22. Nenkova, A., Maskey, S., Liu, Y. (2011). "Automatic summarization,"

HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics.

23. Rusu, D, Dali, L, Fortuna, B, Grobelnik, M, Mladenić, D, (2007). "Triplet extraction from sentences," IS-2007, pp.8-12.

24. Vanderwende, L., Suzuki, H., Brockett, C., and Nenkova, A. (2007).

"Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion," Information Processing and Management, Vol. 43.

25. Yih, W., Goodman, J., Vanderwende, L., and Suzuki, H., (2007).

"Multi-Document Summarization by Maximizing Informative Content-Words," Proc. IJCAI 2007.

26. Zechner, K. (2002a). "Automatic summarization of open-domain multiparty dialogues in diverse genres," Computational Linguistics, vol.

28, no. 4, pp. 447–485.

Methods, and Prospects," Speech Technology Expert eZine, Issue 6, January 2002.

28. Zechner, K. and Waibel, A. (2000a). "DIASUMM: Flexible Summarization of Spontaneous Dialogues in Unrestricted Domains,"

Proceedings of COLING-2000.

29. Zechner, K. and Waibel, A. (2000b). "Minimizing word error rate in textual summaries of spoken language," in Proc. NAACL-2000.

30. Zhu, T. and Li, K., (2011). "The Similarity Measure Based on LDA for Automatic Summarization," Procedia Engineering Vol. 29, pp.2944-2949.

31. Xuan-Hieu Phan, Cam-Tu Nguye, “JGibbLDA, ” http://jgibblda.sourceforge.net/, 2008.

32. “Stanford Log-linear Part-Of-Speech Tagger,” http://nlp.

stanford.edu/software/tagger.shtml

33. “SweSum,” http://swesum.nada.kth.se/index-eng-adv.html 34. “Tools4noobs Online summarize tool,”

https://www.tools4noobs.com/summarize/

35. “Open Text Summarizer,” http://libots.sourceforge.net/

附

Measure 如

precision 0.442012 0.437671 0.437825 0.437671 0.438937 0.4387 0.438189

0.43816

2 來看，任 automata

recall 0.448425 0.443113 0.442979 0.443113 0.443707 0.443586 0.442963 0.443691

任何表現都 3 0.2024 8 0.2022 3 0.2025 8 0.2030 2 0.2012 3 0.2017 4 0.2003

0」來得好 tural Langua ion rec 13 0.213 425 0.213 285 0.213 523 0.213 098 0.214 296 0.211 72 0.212 374 0.211

好，最接近

age Process call

3579 0.2 3056 0.2 3056 0.2 3197 0.2 4163 0 1538 0.2 2179 0.2 1866 0.2

近「0」的數 208163 207042 206963 207154 .2079 20572 206211 205262

數值為後續的文中常口語的

特性來擬出一份適於文字稿的提示詞表。

接著是關係詞：以下針對關係詞進行同樣的實驗，提示詞為不影響實驗結果也設定為「0」，不同於提示詞的實驗，關係詞的權重需要設定較低的數值，因此本論文將關係詞的權重值分別設為 0.07、0.06、0.05、

0.01、0.005、0.001、0.0001，以同樣的實驗集進行同樣的實驗，結果如表 13：

表 13 關係詞 F1-Measure 表

課程名稱 automata Natural Language Processing 權重值 precision recall F1 precision recall F1

0 0.442012 0.448425 0.445159 0.20413 0.213579 0.208163 0.07 0.429671 0.435076 0.432346 0.195058 0.206303 0.199926 0.06 0.429703 0.434983 0.432316 0.195034 0.206303 0.199913 0.05 0.429603 0.434774 0.432161 0.194944 0.206497 0.199927 0.01 0.430841 0.436115 0.433448 0.1962 0.207818 0.201251 0.005 0.430852 0.436308 0.433552 0.195533 0.206119 0.200085 0.001 0.430852 0.436308 0.433552 0.196411 0.206826 0.200894 0.0001 0.43816 0.443691 0.440894 0.202208 0.212081 0.206436

如同提示詞結果一般，權重權為 0 的結果皆優於其他數值。關係詞

也仍不適合在後續實驗加入權重再以微調。

在3.2 節提到關係詞是文字稿標題來做為關係詞，這可能不合適的，什麼適合作為各文字稿相關詞應該要再多加思考。

問題2. Baseline 的公式錯誤

問題 3. 與投影片比對，投影片是否為作為正確答案的依據?，數據為投 影片的結果而有所改變？

在3.3 節有說明到，投影片是經由人為處理的精簡的內容、條列出課程的重點，所以投影片是最為適合作為比對用的資料。

問題4. 追加課程的數量與分類

有針對課程分類有進行追加，課程數量在本論文是減少四堂，因為針對課程的內容及相對應的投影片重新確認並修正為適合輸出摘要的資料，在未來研究將會追加新的課程。

問題5. LDA 摘要相關文獻探討及三元組的需加以補充 已在2.4 節以 2.5 節進行補充。

問題6. 名詞和動詞一起直接進行 LDA 這與一般的LDA 會相當接近。

問題7. 4.3 節的文件數數值異常，請加以解釋

已針對文件數重新實驗並將數據更新於4.3 節。

在文檔中以LDA為基之英文課程文字稿摘要法 (頁 75-81)