中文文本作者辨識研究: 以社群網站--臉書為例

全文

(1)國立臺灣師範大學英語學系碩. 士. 論. 文. Master Thesis Department of English National Taiwan Normal University. 中文文本作者辨識研究以社群網站 – 臉書為例. Chinese Authorship Identification A case study based on Social Corpus – Facebook. 指導教授：謝舒凱 Advisor: Dr. Shu-Kai Hsieh 研究生：陳美瑜 Student: Mei-Yu Chen. 中華民國一０二年八月.

(2) Abstract Individual’s writing difference (Stylometry) has been a popular research interest. In Linguistics, researchers want to know whether the individual difference can be quantified and measured by a certain indexes or statistics (Tweedie & Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004). From Information Technology perspectives, nowadays, there’s an increasing need for document forensics to detect the authorship of anonymous documents either to help investigate internet crimes or to serve different document classification purpose. This paper introduces different ways of measuring the individual writing difference from both Linguistic and Information Technology disciplines. Two experiments are carried out on individuals’ texts collected from the prevalent social media platform—Facebook—to investigate to what extent Chinese characters and lexicons can capture the individual’s writing difference, and to what extent other textual attributes, such as the structural, subjectivity, and emotion clues can contribute to this kind of social short texts. Also, this study examines three different kinds of feature weighting methods (i.e. tf-idf, frequency, ratio) and compare their efficiency in the short texts classification. A recently released SVM classifier, LibLinear, is adopted. The special design of this software package not only makes it more adapted to document classification tasks, where the dimension of features is extremely high, but also can provide ranking scores of each feature that tell the researchers which feature in the feature set can best discriminate and represent a specific category. From the result of the first experiment, tf-idf weighting outperforms the measure i.

(3) of the ratio, but didn’t outperform the measure of frequency. The result shows that in this kind of social short texts, keywords seldom repeat themselves no matter locally or universally. This might be attributed to the relative short length to include more words in a single post and also the characteristics of social platform that people change topics frequently. Therefore, the benefit of tf-idf that degrades the weighting of functions while promoting that of locally frequent content words doesn’t show extra discriminant power compared to the more simple measure of frequency. Also, the preassumption of tf-idf that assumes function words won’t provide information about the author’s preference might not be adequate. Another common issue when carrying out Chinese authorship identification is the segmentation problem. Unlike alphabetic langagues, Chinese doesn’t have word boundaries on the surface structure. Thus, much previous research chose to tackle this language by non-segmented approaches. The second experiment demonstrates the discriminating power of different levels of lexicons (i.e. character-based and word-based unigram, word-based bigram, mix of character and words) in Chinese authorship identification. The result shows that word-based features have much better performance than character-based features. Also, in the second experiment, different feature levels are taken into account (i.e. the structural level, subjectivity level, and emotion level). The result shows the important role of subjectivity and emotion clues to the genre of the social media texts.. Key words: authorship identification, Stylometry, SVM, text mining, individual difference, emotion, naturalistic data, social media short text. ii.

(4) 摘要個人寫作風格差異(風格學)一直是熱門研究主題。從語言學角度觀察，研究人員嘗試各種量化方法及建立各種指數希望能將「個人差異」量化（Tweedie & Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004）。而從資訊科學領域來看，現今社會對「語言鑑識」或「文件作者分類」有漸增的需求，因為在數位化的時代，人們需要這項技術來幫助偵測漸增的網路匿名犯罪，或是幫助數位化文件作者分類。此篇論文首先介紹兩種學科對於個人寫作風格差異的研究方法，並且進行兩項實驗。實驗採用現今流行的社群網站 Facebook 上的個人語料來探索中文的字 (characters)與詞(words)能對個人寫作差異提供多少解釋力，並且探勘其他的文件風格，諸如:結構、主觀化、情緒特徵等，能對社群短語的作者判斷提供多少幫助。並且此研究坦討於常見的特徵權重 (tf-idf、詞頻、比例分布)計算中，何種權值能提供較佳的準確值。本實驗採用新式向量機套件— LibLinear 做為作者分類器，此分類器套件特殊的設計使其更適應於高維度的特徵訓練，例如「文件分類」這種需包含為數眾多的詞作為特徵值的任務。且不同於一般的分類器， Liblinear 能提供每項特徵對應不同分類別的的貢獻分數，因而能幫助研究者檢視何種特徵最能代表該作者類別。從實驗一的結果得知，tf-idf 特徵權的表現略比比例分布佳，但並未比詞頻的表現好。這個結果顯示在此類社群短語中，不論是在單則文章中或是整個實驗語料庫中，關鍵詞鮮少重複出現。原因有可能來自於在社群網站當張，短語的特性使其所能包含的文字較少，以及人們在此種社交平台上傾向不斷更換主題的特性。因此 tf-idf 這種降低功能詞權重並提高文章關鍵詞權重的計算方式， iii.

(5) 沒能在此類短語文章屬性中見其專長，反而簡單的詞頻計算方式表現更佳。並且，這種結果或許反映了在功能詞與內容詞兩種特徵的比較上，tf-idf 預設功能詞特徵對於作者辨識不重要的假設或許並不適當。實驗二展示中文不同階層的詞彙 (例如:字、詞、二字詞、字與詞混合)能提供的作者辨識度。另一個常見於中文作者辨識的議題是關於中文的斷詞問題。不同於字母系統的語言，中文在語言表層結構上並不存在字元間隔以區分單詞。因此先前許多針對中文作者辨識的研究選擇使用不分詞的方法進行分類辨識。本文中的第二項實驗以 CKIP 進行中文分詞，並且同時採用不分詞與分詞後的結果作為特徵值，以探索中文中不同字詞單元分別能提供的作者分類鑑識力（包括以字為本及以詞為本的一字詞單位、以詞為本的二字詞單位，以及混合字與詞）。結果顯示以詞為本的特徵值分類表現優於以字為本的特徵值。同時在第二個實驗中加入了字詞以外的特徵集（包含結構特徵、主觀化特徵、情緒特徵）。結果顯示主觀化特徵與情緒特徵在社群語料文類中的重要性。. 關鍵詞: 作者辨識、風格學、SVM 向量機、文件探勘、個人化差異、情緒、自然語料、社群短語. iv.

(6) Acknowledgments I am grateful to my beloved family and friends. Had it not been for your accompany and assistance, I couldn’t have had the chance of completing the thesis. When I look back to the four years in graduate school, it was a time full of wonders and unbelievable grace that I’d never forget. Through this journey I’ve learnt how to deal with self-doubt when the challenge seemed so arduous, learnt how to be like a persistent person who I’ve never thought I could possibly be one, learnt how to find ways when the path was dim and unclear, and most importantly, I’ve learnt the power of always being positive even when in the face of adversity. In this amazing yet difficult journey, I’d especially like to express my gratitude to my advisor, Dr. Shu-kai Hsieh. You are not only an intelligent scholar, but also a wise mentor. I am grateful of having your guidance throughout the priceless four years during which I’ve been pushed to the limit of my abilities and finally discovered that I could do more than I had imagined. Your thoughts and opinions are always so much inspiring, precise, and forward-looking. It’s my honor to have been your student. In a great journey, adventure companions also play a great role. First of all, I’d like to thank all my LOPE lab members for all your support and witty brainstorming, and not to mention all the wonderful time we’ve been sharing. Julia, Hsin-Ni, Ajax, Yu-Yun Chang, Matt, Simon, Shu-Yan, Wallace, Chang-An, Mike, Sheng-Fu, Aji, you know I am talking about you! Your talent and passion have inspired me all along the way. Special thanks to Ming-Hen Tsai, my great helper and teacher in machine v.

(7) learning technique. You have helped me a lot in clarifying the concepts and details. I am so much impressed by your intelligence. Also my dear classmates at NTNU: Abbie, Gina, Monica, Tammy, Bebe, Sheila, Max, Lina, Ann, Jessica and all the others. Thank you all for the great many precious break time, lunch time, and after-school dates. You are my source of relaxation and happiness. And many friends of mine who have always been caring, thoughtful, and encouraging: Lynn, Venson, Stephanie Yu-Leng Lin, Alex, Carrie, Caca Chen, Jamie Lan, Holy, Tiff Ho, Sylvia, Karen, Doris, Vanessa, Wendy Chang, Ray Lin, Sherry Tsai, Guan-Yu Chen. Your presence is a gift from the heaven. I love you. Last but least, I want to dedicate my sincerest gratitude to my family members, especially my parents. Mom, Dad, and Judy, I love you the best in the whole universe. You are the best parents and sister. I am the luckiest person to have you to be my family. Also my grandparents, my cousins, Tony, Andy, Daniel, Neko Chang, Vincenzo Huang, Yu Chin Huang, and Aunt Terry, you bring a lot of love and joy to my life. The last person is my boyfriend, Vicken Yeh. You have always been my best friend in life, and the best teacher in coding. Your patience and ceaceless love gave me great encouragement and bravery to pursue and fulfill my dreams. Thank you and I love you. If you didn’t see your name in the list, and you’re now reading this acknowledgement, please simply help me add your name onto the list. Due to the space limitation, I am not able to list out all of my beloved names, however, you know you are in my heart, and I very appreciate your companion. I dedicate my sincerest gratitude to all of you, and wish you all the best. Thank you.. vi.

(8) Content Abstract ........................................................................................................................... i 摘要.............................................................................................................................. iii Acknowledgments.......................................................................................................... v Chapter 1 Introduction ................................................................................................... 1 Chapter 2 Literature Review .......................................................................................... 6 2.1. 2.2. Stylometry .................................................................................................. 7 2.1.1. Jayne (1980) and Opas (1996) ....................................................... 7. 2.1.2. Burrows (2002, 2003, 2007) ........................................................ 10. 2.1.3. Discussion on Quantitative-Statistical Approach......................... 17. Authorship Analysis ................................................................................. 20 2.2.1. Authorship Identification ............................................................. 21. 2.2.2. Characteristics of authorship identification ................................. 22. 2.2.3. Feature Selection .......................................................................... 28. 2.2.4. Feature Weighting ........................................................................ 33. 2.2.5. Machine Learning Technique....................................................... 37. Chapter 3 Methodology ............................................................................................... 41 3.1. Rationale—. 3.2. Corpus Construction ................................................................................ 44. 3.3. Background and Motivation ........................................... 41. 3.2.1. Design criteria .............................................................................. 45. 3.2.2. Data Collection ............................................................................ 46. 3.2.3. Pre-Processing: Data Cleansing and Segmentation ..................... 47. Machine Learning Process ....................................................................... 48 vii.

(9) 3.4. 3.5. 3.3.1. Choosing Classifier – LibLinear .................................................. 48. 3.3.2. Machine Learning Process ........................................................... 51. Experiment 1 ............................................................................................ 53 3.4.1. Purpose and Design...................................................................... 53. 3.4.2. Feature Set ................................................................................... 54. Experiment 2 ............................................................................................ 55 3.5.1. Purpose and Design...................................................................... 55. 3.5.2. Features Set .................................................................................. 57. Chapter 4 Result and Discussion ................................................................................. 64 4.1. Experiment 1 .......................................................................................... 64. 4.2. Experiment 2 .......................................................................................... 67. 4.3. In comparison with the quantitative-statistical approach......................... 83. 4.4. In comparison with the machine learning approach ................................ 85. 4.5. Linguistic Discussion ............................................................................... 87. Chapter 5 Conclusion ................................................................................................... 89 Bibliography ................................................................................................................ 91 Appendix A: The topmost IG-3000 words ................................................................... 99 Appendix B: Error analysis ........................................................................................ 125 Author 1 ............................................................................................................. 126 Author 2 ............................................................................................................. 128 Author 3 ............................................................................................................. 130 Author 4 ............................................................................................................. 131 Author 5 ............................................................................................................. 132 Author 6 ............................................................................................................. 133. viii.

(10) ix.

(11) List of Tables Table 1. The fifteen attributes used in Jaynes’s stylometric study based on Janynes’ classification .......................................................................... 7 Table 2. The measures of Delta, Zeta, and Iota by Burrows (2002, 2003, 2007) .................................................................................................... 14 Table 3. The definition and possible limitation of Delta, Zeta, and Iota ..... 15 Table 4. Features extraction in Zheng et al. (2006) ..................................... 24 Table 5. The accuracy rate in both language tested by SVM ....................... 26 Table 7. The number of collected authors and posts .................................... 45 Table 8. Feature set and weighting in experiment 1 .................................... 54 Table 9. Samples of punctuation and symbols in F2 (character-based) ....... 54 Table 10. The accuracy rate of three different weighting methods, and different levels of feature set (F1 / F1+F2) .......................................... 64 Table 11. F1 and F2 Feature set and weighting in experiment 2 ................. 58 Table 12. Samples of punctuation and symbols in F2 (word-based) ........... 59 Table 13. The structural level features ......................................................... 60 Table 14. The semantics level features ........................................................ 61 Table 15. The emotion level features ........................................................... 62 Table 16. The accuracy rate of different F1 feature sets .............................. 68 Table 17. The Top IG-50 features in F1 and F1+F2 feature set ................... 71 Table 18. The comparison of unigrams and bigrams as the component of F1 .............................................................................................................. 75 Table 19. The result of F1 + F2 + F3 (Structual) ......................................... 77 Table 20. The result of F1 + F2 + F4 (Semantics) ....................................... 78 x.

(12) Table 21. The result of F1 + F2 + F5 (Emotion) .......................................... 79 Table 22. The top IG-200 features that were also in the F5 feature set ....... 82 Table 23. The top IG-400 features that were also in the F5 feature set ....... 82 Table 24. Feautres that causes error prediction of author 2 to author 1 ..... 126 Table 25. Feautres that causes error prediction of author 3 to author 1 ..... 126 Table 26. Feautres that causes error prediction of author 4 to author 1 ..... 126 Table 27. Feautres that causes error prediction of author 5 to author 1 ..... 127 Table 28. Feautres that causes error prediction of author 6 to author 1 ..... 127 Table 29. Feautres that causes error prediction of author 1 to author 2 ..... 128 Table 30. Feautres that causes error prediction of author 3 to author 2 ..... 128 Table 31. Feautres that causes error prediction of author 4 to author 2 ..... 128 Table 32. Feautres that causes error prediction of author 5 to author 2 ..... 129 Table 33. Feautres that causes error prediction of author 6 to author 2 ..... 129 Table 34. Feautres that causes error prediction of author 1 to author 3 ..... 130 Table 35. Feautres that causes error prediction of author 2 to author 3 ..... 130 Table 36. Feautres that causes error prediction of author 4 to author 3 ..... 130 Table 37. Feautres that causes error prediction of author 5 to author 3 ..... 130 Table 38. Feautres that causes error prediction of author 6 to author 3 ..... 130 Table 39. Feautres that causes error prediction of author 1 to author 4 ..... 131 Table 40. Feautres that causes error prediction of author 2 to author 4 ..... 131 Table 41. Feautres that causes error prediction of author 3 to author 4 ..... 131 Table 42. Feautres that causes error prediction of author 5 to author 4 ..... 131 Table 43. Feautres that causes error prediction of author 6 to author 4 ..... 132 Table 44. Feautres that causes error prediction of author 1 to author 5 ..... 132 Table 45. Feautres that causes error prediction of author 2 to author 5 ..... 133 xi.

(13) Table 46. Feautres that causes error prediction of author 3 to author 5 ..... 133 Table 47. Feautres that causes error prediction of author 4 to author 5 ..... 133 Table 48. Feautres that causes error prediction of author 6 to author 5 ..... 133 Table 49. Feautres that causes error prediction of author 1 to author 6 ..... 133 Table 50. Feautres that causes error prediction of author 2 to author 6 ..... 134 Table 51. Feautres that causes error prediction of author 3 to author 6 ..... 135 Table 52. Feautres that causes error prediction of author 4 to author 6 ..... 136 Table 53. Feautres that causes error prediction of author 5 to author 6 ..... 136. List of Figures Figure 1. The similarity of two documents represented in vectors .............. 34 Figure 2. Representation of large-margin classifier ..................................... 38 Figure 3. Large margin can avoid bias data ................................................. 39 Figure 4. The snapshot of the application .................................................... 47 Figure 5. The process of the experiment ...................................................... 51. xii.

(14) Chapter 1 Introduction With the advent of Internet era, web activities have become more and more prosperous, people expressing their opinions on the internet, making friends in the virtual world, buying daily necessities or luxuries on big portal websites. Some people make use of the popularity of internet activities to do all kinds of malicious behaviors, including frauds of e-commercial business, stealing and selling users’ private information, and also inducing immature internet users to meet in the real world with vicious intention, sexually or financially. These illegal behaviors brought up a new demand on the authorship forensics, also known as authorship identification or authorship attribution in Information Retrieval field to identify the authentic authorship of the malicious pieces of texts. Since the cybercrimes take place online, the traces that have been left on the Internet become evidential. The use of language is one of the useful clues which can assist the Internet law enforcement unit to identify the criminal when other pivotal digital evidence is unavailable, such as the IP address, MAC number, or log records etc., to prove guilty (Chaikin, 2006). The authorship identification then serves as the forensic tool by collecting the existing linguistic fingerprint of the suspect, and analyzing the authorship writing style given the suspect’s corpus, which is basically 1.

(15) the collection of texts one has written on the Internet. This growing demand is shown in the newly emerging field—Digital Investigation. This technique is mainly applied in e-mail writeprints forensics (Iqbal, Binsalleeh, Fung & Debbabi, 2010; Iqbal, Khan, Fung & Debbabi, 2010; Hadjidj, Debbabi, Lounis, Iqbal, Szporer & Benredjem, 2009; Iqbal, Hadjidj, Fung & Debbabi, 2008; Ma, Li & Teng, 2008; O. de Vel., Anderson, Corney & Mohay, 2001) to help identify and filter malicious mails or sort the possible source of crime community (Ma, Teng, Chang, Zhang & Xiao, 2011). From the linguistics perspective, the individual’s writeprints, which can be interpreted as idiolect in the written form, has attracted a broad research interest for a long time. From the macro-perspective, socio-linguists have long been interested in different language uses among different subgroup of society, from groups of different gender, education background, age, to that of opposed politic affiliation. The subculture differences, determined by the environment we live in, the information we absorb, and the people we contact with, mold people’s beliefs, thoughts, and are eventually reflected in our language use. The variety of language use manifested in one’s vocabulary bank, choice of grammar structures, strategies of discourse planning as well as the ways of hedging, for example the different degree of politeness maxims. These distinguishable language differences are influenced mainly by the ongoing subcultural, or say environmental, interaction. Therefore, by observing and analyzing language performance, which can be directly measured and observed by the words distribution, we have the opportunity to reveal the underlying cultures, beliefs or thoughts one may have possessed. This is also one of the profound interests of cognitive linguists. They devoted themselves in investigating the relationship between language manifestation and human’s cognition. From the perspective of cognitive linguistics, different cultures have divergent 2.

(16) perception towards time, space, colors and so on, resulting in the difference of language expression. Different perception or concepts towards the real world surrounding us could have been manifested in our language use, and therefore generated the diversity of individual’s language expression. Koppel, Argamon & Shimoni (2002) conducted an authorship characterization study mainly by the measure of function words and POS preference from texts to predict the gender of the authorship. The concepts underlying the language structure are intriguing and worth of investigating, and therefore fascinate the linguists much. This is the reason why we need a more efficient way to discover the underlying patterns, especially considering the complexity and seemingly randomness of language itself. How can we make use of the advanced computing technology to mine in the linguistic data and extract the interested units of the research concerns? Most of the associated linguistic studies, those in Stylometry discipline, conducted in tradition relied heavily on manual observation to anchor the target construction or word usage no matter in qualitative or quantitative approach, and thus sacrificed much covert language evidence, which is not easily detected even through a close scrutiny. This current thesis aims to adopt the machine learning approach—SVM—to investigate the author’s writing habit, and explain why this approach is better than the traditional quantitative approach in understanding author’s stylistic variation. When it comes to the methodology, the traditional quantitative approach has mostly based on the exploration of the ‘constant’, which provides the measure of the degree of variation of a specific author from the norm. For example, the Delta, Zeta, and Iota measurement is proposed by Burrows (2002, 2003, 2007). These indexes are based on 3.

(17) either the z-score or the frequency distribution of words. Also, Tweedie & Baayen (1998) have examined numerous ‘constants’ that were claimed to have discriminative power in identifying authorship of literary works. Although the traditional quantitative approach has had a solid foundation and has induced abundant intriguing research, there’s yet another newly developed technique for the researchers to choose from when the research purpose is to cull or compare the individual difference of one author with the others. In addition to comparing the latest approach with the traditional quantitative method, this current study will perform a Chinese authorship identification task, using the SVM as the classifier and examining different linguistics feature sets. A comparison to the current Chinese authorship identification studies in the digital forensics field (Ma et al., 2008, 2009) is provided. Since few authorship identification studies tackled Chinese online texts and the optimized feature categories are still an open issue, which needs sufficient amount of empirical experiment to prove its validity, this study also wants to know under the circumstance when the information of textual format is limited, whether the linguistic feature set along can efficiently serve as a good discriminative factor? As previously mentioned, if different perception towards the world indeed manifest in one’s language, to what extent could language show its diversity among the authors and to what extent are they shared by every individual? To sum up, in this thesis I would like to demonstrate how we can make use of computational techniques to effectively detect the uniqueness of every individual’s language use and further implements an authorship identification experiment to predict the authorship on Chinese online texts. This thesis is divided into five chapters. Following the general introduction in Chapter 1, Chapter 2 introduces different 4.

(18) conventional methods in authorship identification from both Stylometry and Information Retrieval disciplines. Chapter 3 describes the details of the proposed authorship identification experiments implemented on the Taiwanese Mandarin corpus, Facebook. This chapter will include the rationale, methodology, corpus construction, as well as the result and discussion of these experiments. Chapter 4 discusses the comparison of computational issues, evaluation methodologies, and criteria for Chinese authorship identification studies, and lists open questions that are left for future works. Chapter 5 concludes the study.. 5.

(19) Chapter 2 Literature Review This chapter will first introduce the traditional quantitative method in the Stylometry studies (section 2.1), illustrating how previous researchers measure the individual difference/distance among well-known authors by combing linguistic observations with statistical approaches. Next, generalization of current authorship analysis approaches in Information Retrieval field will also be introduced (section 2.2). Despite the different sources and genres of texts in question, the computing method also shows great variation from that of the traditional statistical approach. A comparison and reason why the new classification technique, which is widely adopted nowadays in the Information Retrieval field, might be more suitable for individual comparison especially on the modern texts format will be later explained in the article.. 6.

(20) 2.1 Stylometry 2.1.1. Jayne (1980) and Opas (1996). Authorship analysis is mainly concerned with the ways in defining an appropriate characterization of documents that captures the writing style of authors. It is highly associated and rooted from linguistic discipline—Stylometry. The technique was originally applied to anonymous literary pieces verification (Hope, 1994). In English literature, there were occasional cases that either the authentic author was forced to use anonym to publish works in order to convey certain unacceptable ideas publicly, or the authentic authorship was controversial (Mosteller & Wallace, 1963) In Stylometric studies, the only way to dig out the authentic author is similar to the forensic process, where a set of linguistic clues is the only supporting evidence to investigate.. The use of vocabulary, including vocabulary richness, the average length of words, the rare collocation, as well as the use of function words (Mosteller & Wallace, 1963; Baayen, Van Halteren & Tweedie, 1996; Burrows, 1989), shows writer’s habit in composing articles, whether it works consciously or subconsciously. Other frequently used linguistic evidences are of the lexical, sentential, structural or paragraph levels. Jaynes, J.T. (1980) chose 15 attributes to analyze the poetic style of a famous poet—W.B. Yeats—in different periods. Table 1 categorizes these attributes into different linguistic dimensions.. Table 1. The fifteen attributes used in Jaynes’s stylometric study based on 7.

(21) Janynes’ classification Word level. 1. Frequencies of heads of noun phrases (nouns, pronouns) 2. Portion of head which are nouns 3. Portion of head which are main verbs 4. Portion of adverbs 5. Portion of conjunctions which are coordinators 6. Ratio of determiners to nouns 7. Ratio of adjectives to nouns 8. Ratio of prepositions to nouns 9. Ratio of auxiliaries to verbal phrase heads. Phrase level. 10. Heads of verbal phrases (main verb followed by infinitive or followed by participle) 11. Preposition phrase words. Clause level. 12. Conjunctions (Coordinating, and subordinating conjunctions, and relative and interrogative pronouns) 13. Predicate as part of the main clause-elements (predicate, subject, compliment). Semantic level. 14. Clause and phrase signal ‘not’, and interjections 15. Nominal to verbal generalizations. Although Jaynes’ statistical investigation showed significant changes found in noun as heads, preposition/ noun ratio, main verbs as heads, auxiliary verb heads, and prepositional phrases words, it didn’t provide linear patterns to give a generalized explanation on Yeats’s chronicle poems throughout the different periods, and thus 8.

(22) Jaynes was forced to draw a conclusion that Yeats’s syntactic style remained stable. However, it was contradictory to the general critique and also Yeats’ personal opinion. The other attempt on thematic shift with features derived from diction showed a better recognition. This investigation focuses on the ratio of function words to content words, the mean length of words, and also the theme vocabulary picked up from three different periods, for instances, Celtic-mythology-related words, Classical Greek and Italian, and Modern Irish mythical figures. With the help of word level information, a line is finally drawn between Yeats’s early and middle stages. Another stylistic development experiment has been conducted by Opas (1996). To examine Beckett’s plays from different dimensions, she adopted Biber’s model (1991), such as narrative vs. non-narrative, abstract vs. non-abstract etc. Biber proposed different dimensions to distinguish speech from writing. Although many linguistic features have been examined and identified, similarly, Opas couldn’t find consistent trends in any dimensions, and thus concluded that Beckett’s styles were miscellaneous and innovative. From the above researches we can see the limitation of the traditional approach in stylometric studies. While linguists chose a carefully designed set of linguistic details trying to depict the literature pieces in a quantitative way, the result seemed not supportive to their envision. Although these feature sets, referring to Table 1, indeed can describe some linguistic details quantitatively, we need to contemplate why the evidence did not support the expertise’s opinion. If we look closer to Jayne’s result in the two experiments, the first one bearing the features in Table 1 whereas the second one using Yeats’s diction as the feature set, the reason seems to be rather clear. The result plainly suggests that lexical features perform better than the structural features. Though Jayne didn’t define the features in Table 1 as structural features but 9.

(23) instead name them as features under word/phrase/clause/semantic levels, they are in fact structural features manifesting in different linguistic levels. The word level features here should be distinguished from the lexical features. Considering the features listing under the word level in Table 1, to name a few, they are frequencies of heads of noun phrases (nouns, pronouns), portion of head which are nouns, portion of head which are main verbs, portion of adverb, portion of conjunctions which are coordinators. These features are associated with the ratio or frequency of the abstract word category, that is, the POS information, which is the more generalized category in language and thus is expected to have a more generalized behavior that could hardly perform well in mining the subtle individual differences. On the other hand, the lexical features could better demonstrate the variety and distinction of every individual writer. One may wonder to what extent the words we chose to use can really present our linguistic style, and this is also the main interest of the current study. In the next chapter we will conduct authorship analysis with new classification technique, distinguish the structural and lexical feature sets, and prove that while structural features can assist the performance in authorship identification, the lexical features along do reach a profound level of accuracy in the verification of authentic authorship.. 2.1.2. Burrows (2002, 2003, 2007). Before heading directly towards the new area, it is worth mentioning that there’s one scholar who is aware of the importance of lexical features and he conducted a series of experiments with the statistic approach. Burrows (2002, 2003, 2007) 10.

(24) proposed a Delta rule based mainly on the z-score and mean deviation to investigate whether we can accurately predict the authentic authorship of certain poems in question by merely examining the frequency ratio of a bag of words, which is composed of the most frequent function words (Burrows, 2002, 2003), the second frequent stratum, content words that were consistently appearing in the certain author’s writing but not so frequent in the others’, and the rare words that only appeared in the certain author’s texts (Burrows, 2007). The approach he proposed seems to be too simple to be powerful, but it has been testified and proved valid in Hoover’s study (Hoover, 2004). The intuition and hypothesis behind his experiment is simple. As mentioned (Burrows, 2007), a writer’s vocabulary is a deliberate selection, which is not a random result, from the whole set of the given language vocabulary, and the preference could reflect the differentiation of the writer’s background, such as the level of education, and the way he or she preferred to convey opinions given different groups of target audience, and his or her interested topics as well. Their sophisticated differences in constructing sentence structures could be detected from the function words’ behavior. He proposed three different indexes—Delta, Zeta, and Iota—as the measures to calculate the average occurrence rate of words of different frequency stratums—the most common words, common words, and uncommon words—in the target text in question and compare them with that of the other authors’ pieces. We start by introducing the definition and calculation of Delta score:. ,. =. ! , − # ,. &,. 11. $%. $%. (1).

(25) ) *+, =. 1. 0. -. 12. , . −. ,/ . (2). Burrows proposed a new measure termed ‘Delta’, defined as “the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text” (Burrows, 2002). Given a word list of the-most-common n function words1 compiled from the reference corpus, every word has a mean and standard deviation of occurrence-rate value calculated to serve as a norm standard for the future prediction to compare with. Therefore, given a piece of text written by author j, the z-score of every individual word i in the word-list is calculated, and we obtain. ,. . Under. the presumption that function words are good indicators of author style, Delta score can measure the variation degree of author style.. Because the Delta score is calculated based on the differentia of two authors in comparison, the bigger the Delta score is, the larger degree of the variation was in the two pieces of texts, and vice versa. In order to further observe more detailed information, Burrows also calculate the Delta-z to give a normalization of the Delta score. The Delta-z score is calculated as follows.. 1. In Burrows’ study (2002, 2003), he adopted 150 the-most-common function words calculated from. the corpus he constructed. The most common word-list is generated by the frequency-hierarchy of words from a reference corpus composed of 32 long poems collected from the Main Set of 25 poets of the English Restoration period, and the other 16 poets outside the Main Set group as the comparison group, and its collection of words ends with words that occur less than once in every thousand. These long poems in the reference corpus range from two thousand words to almost twenty thousand words. 12.

(26) ) *+, 3 = . Delta − Delta # (3) Delta &. Given the prior description, we know that when a given text written by i showing a smaller differentiation with the target text in question, yielding a smaller Delta i, given the mean and standard deviation fixed, the molecular Delta − Delta # will become extremely small and negative. That is to say, the author with the smallest Delta z score in comparison to other competitive authors is most likely to be the authentic author of the text in question, because the variation of the use of function words is the smallest.. +, = . The total word tokens that satisfy the +, word type (4) Total word length of the texts. AB+, = . The total word tokens that satisfy the AB+, word type (5) Total word length of the texts. Compared with Delta score, the Zeta and Iota score are relatively simpler. Despite the more sophisticated part in choosing suitable words into the word-list corresponding to the two categories, their computation is simply the occurrence-rate of the words in correspondence to the pre-defined criterion of two categories. In order to exclude the function words, Burrows designed the Zeta category to capture the common words, which appeared less than function words and belonged to frequency stratum 2, of the specific author. As for the Iota category, according to Burrows, he aimed to capture rare words of the specific author. Therefore, the criteria of choosing the word-lists for the different categories were listed in Table 2.. 13.

(27) Table 2. The measures of Delta, Zeta, and Iota by Burrows (2002, 2003, 2007) Task. Index. Include. Exclude. Measured unit. Delta. Most common. None. See formula. 150 words One v.s one. Zeta. Iota. One v.s. Zeta. multiple Iota. (1)(2)(3) above. Base-set:. Counter-set:. See formula (4). > 2ex5 Segs.. Frequency < 3. above. Base-set:. Counter-set:. See formula (5). < 3ex5 Segs. Frequency 0. above. Base-set:. Counter-set:. See formula (4). > 2ex5 Segs.. < 23 poets. above. Base-set:. Counter-set:. See formula (5). < 3ex5 Segs. < 11 poets. above. The choice of the most common words, which were words in frequency stratum 1, is as described in the Delta experiment. For the Zeta and Iota experiments, he divided the target pieces, which has undisputed authorship, into 5 segments. Those words appearing in more than three segments, including three, in the target piece and not appearing in more than 22/24 of other author’s work were categorized as the common words, which were words of frequency stratum 2. The criterion of Iota is to include words appearing in less than 3 segments in the target piece and exclude those appearing in more than 11/24 of the other authors’ work. The summarized detailed representation for the different categories and their possible limitations are listed as below.. 14.

(28) Table 3. The definition and possible limitation of Delta, Zeta, and Iota. Delta. Measurement. Possible Limitations. The occurrence rate of the most. 1. The sum deviation of all the. common words. function words cannot reflect each. (frequency stratum 1; usually. word’s individual difference and. are function words). distribution. 2. Its application on the shorter text is questionable, as the full set of word-variables is difficult to obtain in th shorter text.. Zeta. The occurrence rate of the. 1. The criteria of words in this. common words, while excluding. category is too strict that only those. the most common words. words that were not used by 22/24. (frequency stratum 2; with the. authors (0 occurrence) are included.. criteria listed in table 2, word in this category are most likely to be. 2. This criteria would not only. content words which are highly. exclude the function words but also. associated with the topic of the. some common words. Some words. specific text). have wide distribution among different pieces, though not being the most common ones and sharing the extremely high frequency like those in frequency stratum 1 do.. Iota. The occurrence rate of the not so 15. 1. Though being defined as ‘not so.

(29) common words. common words’, words in this. (frequency stratum 3; not the. category should at least appear in 2. topic words in the texts, could be. segments out of 5 in the target. rare words when no more than. author’s piece of text, which still. 10/24 authors used in their texts). hold a considerable wide distribution, especially when the text is lengthy. It could hardly capture the real ‘rare words’.. In the table, readers can get an overview of the definition and possible limitation of the three categories. Burrows also mentioned the resulting limitation of these studies is due to the fact that these tasks need to be performed independently, the accumulated effect of the total of words in the three categories could not be observed under this procedure. Indeed, although these independent experiments do show considerably high accuracy in most of the prediction tasks, yet under some occasions, especially when the text length varies, these indexes could not always precisely differentiate every author’s writing habit by simply comparing the occurrence-rate of the words of different frequency stratums. Therefore, he suggested that when the text length is long enough, the relative occurrence-rate of the word-types is a valuable differentia. But when the texts in comparison are of different length, the occurrence-rate of the word-tokens is a more useful measurement.. 16.

(30) 2.1.3. Discussion on Quantitative-Statistical Approach. Regarding how efficient the measurement of word types (vocabulary richness) performs in authorship attribution and to what extent it is affected by the text length, Tweedie and Baayen (1998) conducted a comprehensive study to examine a variety of alternative measures of vocabulary richness or repeat rate, which were popular constants in authorship attribution studies to discriminate individual authors. Many of these measures of lexical richness were claimed to be independent, or roughly independent of text length. However, after testing both theoretically and empirically, they made a conclusion that most of these indexes were highly text-length dependent. They stated in their report that almost all proposed measures of these ‘constant’ of author’s style change considerably in systematic ways with the length of text. Their results thus question the efficacy of including various ‘constants’ in authorship attribution studies (e.g. Holmes, 1992; Holmes and Forsyth, 1995) They conclude that, despite the large number of measures available, none of the common measures of vocabulary richness are truly constant with respect to text length. Intuitionally, this is not difficult to understand. When the text length increases, the possibility of texts covering a wider range of topics, and thus containing more word types, also increases. Therefore, the text length issue is an important yet difficult factor to be stipulated. No one can assure that what certain amount of length is assumed to be more appropriate and efficient than the others, and should be the norm when the variety of word type (or lexical richness) serves as the ‘constant’, and the only measurement, of authorship attribution. Therefore, Burrows’ relative frequency of words from different frequency spectrums has become the popular analysis method in Stylometry discipline, especially the measures of function words. Despite the fact that function words prove 17.

(31) to be very effective in many stylometric studies, it is, however, a pity that in Burrows’ studies, as he himself stated, words of different frequency stratums could not collaborate and integrate to provide a more detailed illustration of the author’s writing preference under this method. In addition, the fact that the occurrence-rate of word-types and that of word-tokens could not work together under this kind of single-index measurement is also a defect. On the one hand, when observing individual’s writing preference, either the distribution of the most common function words, or the occurrence of common content words to the use of rare words are all important and interesting in observing one’s writing style. If the different words’ behaviors could not be measured under the same scope, we could only observe and analyze author’s writing style from one specific angle, and could possibly lose the opportunity to obtain the accumulated effect that only emerges when every trivial variation was taken into consideration. On the other hand, when analyzing an author’s style and comparing to others, whether the specific author adopted certain words and how often he/she used these words in the texts are of equal important, if we can either choose the occurrence-rate of the word-types (the distinctive types of words) or that of word-tokens (the frequency of words), again, some crucial information would be lost because of the restriction of the experiment design. Therefore, we need a method to include all words, both the function words and content words, in the author’s text, no matter how frequent they are, and a method could simultaneously measure the appearance of word (word-type), and the frequency of word (word-token).. Briefly, the main limitation of the quantitative-statistical approach I have discussed so far can be further criticized from two perspectives. The first one is the need of constructing a ‘norm’ model for every tested item to compare with. The norm 18.

(32) has to be a huge collection of texts to guarantee that it can best represent an unbiased standard, and therefore any biased use of words, which results from the author’s special writing style, can be detected through the comparison process with the ‘norm’. Nevertheless, in practice, the construction of a ‘qualified norm’ is difficult and questionable, because nobody can precisely describe the criteria of building a proper norm. Second, the Delta approach Burrows proposed was to assume that every individual author has their special patterns in the use of every function word (the z-score was calculated by the difference between the local frequency of the function words and their norm behavior). This is to assume that every author has an individualized frame of using function words (based on frequency), and this frame will reoccur and will not alter in his other works. This approach could probably work in the lengthy literature, where the frame of a set of function words’ behavior is observable. However, we can confidently infer that it can’t work on the modern genre of texts, for example, the texts collected from the e-format, whose characteristic is typically shorter and more casual than the literature work. Therefore, another discipline arouse, while inspired and descended from the Stylometry2, from the Information Theory field to tackle the authorship identification issue for the modern texts to meet the need of practical applications in modern texts classification. In the following section, other methods of tackling authorship identification on the modern texts will be introduced.. 2. For those who are still interested in other intriguing stylometry researches, they can refer to. Whissell’s (1996) emotional stylometry tested on John Lennon’s and McCartney’s lyrics during the year 1962-1970, and Burrows’ (1987) authorial and chronological tests on Jane Austen’s text segments. 19.

(33) 2.2 Authorship Analysis Following the long history of stylometry research, a rather new research field called Authorship Analysis has emerged in the similar fashion but differs slightly in research topic. This research field is mainly grounded in the Information Retrieval and Data Mining community. With the growing need to tackle with the tremendous data flow in the digital age, manual investigation on every digital format texts is not practical or realizable, we need an automatic technique to help us verify the possible suspect of a fraud e-mail, no matter which ID among many he chose or disguised, and filter the possible spamming and phishing by automatic computational classification technique. Sharing the same goal with Stylometry, authorship analysis investigates the authorial writing styles, and makes heavy use of linguistic attributes to distinguish one’s style from another. Nevertheless, authorship analysis emphasizes more on the real world applications. The new discipline is not only interested in the disputed literature authorship, it also stretches its antenna onto different categories of authorial disputed items, including the verification of software code, musical patterns, documents, as well as CMC (computer mediated communication) message, in order to meet the new demand in the digital age. Thus, to meet the nature of the new genre of online texts, the main difference between stylometric study and authorship analysis lies in the attributes or clues people investigated. The stylometric attribution longer limited to linguistic features, structural, and idiosyncratic characteristics belonging to every individual’s personal writing habits are considered as well. For instances, in addition to linguistic related features, such as ‘vocabulary richness’, ‘the mean length of sentence or paragraph’ as well as ‘the use of punctuation and function words’, research in this field also 20.

(34) incorporates the following features— ‘total number of digit character’, ‘ has a greeting’, ‘ has quoted content’, or ‘use e-mail as signature’ kinds of structural, paragraph, or digital-based features, in order to reach as many evidence as possible. (Iqbal et al., 2010; J. Ma et al., 2009, 2008) Among the broad interests in the field of Authorship Analysis3, authorship identification study is the most similar one to stylometric study, as they share the same purpose of exploring author’s style and identifying the anonymous texts. In the subsections of 2.2, more details about the conventional approaches in this research field, including the different text genre, concerned features, and classification technique, will be introduced.. 2.2.1. Authorship Identification. The goal of the authorship identification task has been clarified by Iqbal et al. (2010). Disputed textual document is not the only object of identification task. The object in question could be a piece of computational code, a texutual document or an online message. Because both objects in questions and methodologies are diversed, the accuracy of a single authorship verification task should not be the only aim to consider, because the criteria and the setting vary from task to task. Differences in text pool size, sources, features extracted, feature selection, quantity of the examined authors and texts, and the classification technique adopted are all factors that play a role. Although it is impossible to unify all the testing ground, at least given experiments controlled in. 3. The broad interest in the field of authorship analysis could have been further divided in three fields,. namely ‘authorship identification/verification’, ‘authorship characterization/profiling’ and ‘similarity/plagiarism detection.’ 21.

(35) the same setting, we could provide a close scrutiny into the relation between performances of different levels of feature set and the underlying implication they might suggest. Therefore, the current study will examine different levels of linguistics feature sets and discuss how these feature sets represent in Chinese texts. In the following section, we start by introducing the characteristics of study in authorship identification.. 2.2.2. Characteristics of authorship identification. Different purposes derive different ways of research. The research of authorship identification descended from the literary stylometry study to investigate the probability of authentic authorship of specific pieces of writing by examining the linguistic fingerprint that he left among words. The main difference between authorship identification and stylometric study lies in the genre of target texts and the classification techniques. While most stylometry studies adopted statistics and focused on the investigation of idiosyncratic writing styles of well-known writers, such as Shakespeare, studies of authorship identification aims at individual’s texts generated from our daily activities, including emails, posts on the Internet etc. to fulfill its practical purpose—cybercrime detection, personal language style analysis, and texts classification—by applying machine learning methods as the examination tools. Text length and the writing format are other important features that distinguish authorship identification from stylometry study. The target texts in literary field were usually very lengthy in paragraphs, written with careful thoughts, composed in neat format, and throughout the author’s different periods of writing years, the chronological style development. On the other hand, authorship analysis focuses 22.

(36) mainly on comparatively shorter texts, for instances, the emails, blogs, BBS posts, or internet advertisement, and other daily activities that involve in the use of language. They are overall produced under a less formal scenario, more casual language usages, more occurrences of speech errors or typo, and with more unconscious habitual writing preference, for example, the use of punctuation or the greeting in the opening are deemed as idiosyncratic evidence of individual writing style. Owing to the distinctive characteristics, the feature extraction from authorship identification targeted at different perspective. The stylometric study cares more of writer’s linguistics features—the specific topic of vocabulary, choice of diction, preference of word categories, also the semantic perspective (subjectivity and objectivity), and as those mentioned in previous section—in order to compare different writing styles among noted authors by providing explicit, comprehensive and objective linguistic features so as to give each writing style a quantity observation instead of a critique which is generally based on the viewer’s linguistic self-instinct. In addition, targeting a slightly different objective—to identify the authorship of a specific piece of work by discovering the author’s unconscious writing habit, the general linguistic features no longer serve as primary key factor but the idiosyncratic features. Abbasi and Chen (2008) provided a detailed and comprehensive overview of feature sets of previous studies in online stylometric analysis, and classified the previous research with respect to tasks (e.g. author identification, and similarity detection), domains (e.g. asynchronous CMC, synchronous CMC, documents, and program code), features (e.g. number of categories, number of features, and feature-set type), and also classes (the maximum number of classes in experiments). The feature sets they proposed in their study were categorized into 5 conventional groups—Lexical, Syntactic, Structural, Content, and Idiosyncratic level. Lexical 23.

(37) features were mainly composed by the frequency or percentage of character, words in different unit, and also bigrams and trigrams of characters. Syntactic features were formed by function words and also the POS information. Other than the abovementioned linguistic sequential information, the structural category contained the use of greeting, URL, quotation, and technical structure, whether a file is attached. Content category was measured by word information (e.g. keywords by manual selection). The final idiosyncratic category picked out the misspellings. Among the various studies, Zheng, Li, Chen & Huang (2006) conducted a pioneer study in online short messages authorship identification of both English and Chinese texts. The features they extracted are organized as the following:. Table 4. Features extraction in Zheng et al. (2006) F1: Lexical features. Character-based features: 1.. Total number of characters / alphabetic characters / upper-case characters / digit characters / white-space characters / tab spaces. 2.. Frequency of letters / special characters. Word-based features: 1.. Total number of words / short words (shorter than 4 characters) / characters in words. 2.. Average word length / sentence length in terms of character / sentence length in terms of word. 3.. Total different words. 4.. Hapax legomena (Frequency of once-occurring words). 5.. Hapax dislegomena (Frequency of twice-occurring 24.

(38) words) 6.. Different measurements of vocabulary richness (5 kinds of vocabulary richness value). 7.. Word length frequency distribution. F2:. 1.. Frequency of punctuation. Syntactic features. 2.. Frequency of function words. F3:. 1.. Total number of lines / sentences / paragraphs. Structural features. 2.. Number of sentences per paragraph / words per paragraph / characters per paragraph. 3.. Has a greeting. 4.. Has separators between paragraphs. 5.. Has quoted content. 6.. Position of quoted content. 7.. Indentation of paragraph. 8.. Use e-mail as signature. 9.. Use telephone as signature. 10. Use url as signature 1.. F4:. Frequency of content specific keywords (11 features. Content-Specific. manually picked: “deal”, “sale”, “wtb”, and etc.) (Note:. features. these keywords were manually chosen.). Zheng et al. adopted three different kinds of machine learning algorithm—C4.5 (decision tree), NN (backpropagation neural networks), and SVM (Support Vector Machine) to evaluate both English and Chinese short online messages from 5 to 20 authors in each language respectively. The number of messages ranged from 10 to 30 25.

(39) with respect to every individual author. The texts were retrieved from the online bulletin board and forum, and at an average of 169 and 807 words in English and Chinese respectively. They reported the best performance by SVM classification technique, and the following discussion would be based on the SVM result. The accuracy rate was outstanding in English, and had room for improvement in Chinese.. Table 5. The accuracy rate in both language tested by SVM English data set. Chinese data set. F1. 89.36%. 57.78%. F2. 90.03%. 69.16%. F3. 94.66%. 82.77%. F4. 97.69%. 88.33%. As seen in Table 4, lexical features (F1) provided a convincing accuracy at 89.36% in English data set; in contrast, the accuracy in Chinese data set was barely acceptable at the rate of 57.78%. With the different layers of feature sets accumulation, the accuracy increased sharply in Chinese data set from 57.78% to 88.33%. Among the four feature sets, Syntactic (F2) and Structural features (F3) significantly contributed to the accuracy in Chinese case. Although the final accuracy rate 88.33% in Chinese seems to be phenomenal, there are several problems left to be discussed. First, in their experiment, they avoided the use of POS in their data by claiming that POS tagging was still immature for some languages such as Chinese. They also noted that the word boundaries do not exist and words are adjacent to each other in Chinese sentences. Without mentioning the use of 26.

(40) segmentation in Chinese data set, it seems that the lexical features they applied on Chinese text were merely on character basis, treating every Chinese character as a word, neglecting the fact that the socio-psychological status of word does exist in Chinese. The word boundary issue in Chinese is very complicated and open to debate, however, Segmentation systems in Chinese have been well-developed and widely used. Most systems have reached a promising accuracy, such as the CKIP4, provided by Academia Sinica, and ICTCLAS5, developed by Chinese Academy of Sciences. The lack of word segmentation in their experiment resulted in different test grounds in English and Chinese data—several F1 features omitted in Chinese data, for instances, total number of short words(less than four characters), total number of characters in words, average word length, average sentence length in terms of characters in words, and so on. Also, due to the different test grounds, the average message length in English data set was 169 words, but an average of 807 words in Chinese data set. It led to a wonder that the richer text base, which was supposed to contain more clues or characteristics to extract, nevertheless performed poorer than a more confined one. This could explain why the accuracy was significantly improved when F2 feature set, which contained predefined function words, came into play, because actually it were the Chinese word units coming into play. Therefore, a word-based feature set could be an important linguistic clue that deserves a further investigation. The second issue is a general phenomenon which can be seen in most authorship identification experiments, that is, the lack of intrinsic linguistic features examination in Chinese texts. This could possibly be accounted for by the interdisiplinary nature of this research field. Most researches of this interest were from Information Science. 4. CKIP: http://ckipsvr.iis.sinica.edu.tw/. 5. ICTCLA 2013: http://ictclas.nlpir.org/ 27.

(41) field, and very few were froml linguistic domain. Thus, those plausible linguistic feature sets, categorized by linguistic taxonomy—lexical, syntactic, structural, content-specific features, proposed in pervious researches may require a more delicate inspection or definition from linguistic points of view, and could possibly be benefited from different perspectives. The feature selection procedure in this field is an open question, since there’s no consensus about the optimized number of sets of features. The following section we will explore more about the feature selection process.. 2.2.3. Feature Selection. Feature selection is an essential yet complicated issue in authorship identification. It has long been a research question that is still open to debate and has not yet reached a consensus of the optimal feature sets. As Diederich, Kindermann, Leopold & Paass (2003) said, even when 1,000 style markers have been specified (Rudman, 1998), there is no concensus on the signifincant style markers. They conlude almost all words contain some information in the text categorization.They also mentioned a carefully conducted research made by Joachims (1998) in examining the information contribution with respect to different number of features.. “Joachims ranked 10,000 word stems of a large corpus according to their information gain with regard to some classification. It turned out that a model using features with ranks 201-500 performed nearly as well as the best features in the top 1-200, and similar to the feature set 4001-9962.”. From Joachim’s experiment, even when information gain (a method widely used in 28.

(42) classification to calculate the increased amount of information one additional feature carried and contributed to the result) were taken into consideration, features ranked 1-200 did not outperform those ranked 201-500, nor those ranked 4001-9962. Numerous reasons can account for the difficulty of feature selection, but the main reason is because language itself is sophisticated and complex, composed by lexical, syntactic, semantic, structural, cognitive levels and so on. Many features of different linguistic levels contribute to an author’s language use all together. Features can be further divided into two categories—static and dynamic. (Abbasi, 2008). Static and dynamic features Static features refer to those features which can be calculated in every author’s texts. Common static features are the mean word length, mean sentence length in terms of word and characters, number of words, number of lines, vocabulary richness, hapax legomena (frequency of once-occurring words) etc. These features can be applied to every training and testing piece of text, because they are simply numeral characteristics representing each texts based on frequency, length, and average frequency or length (in terms of rate). On the other hand, dynamic features are not as instinctive as static ones. In contrast to the static features, dynamic features vary with respect to distinctive author’s writing habit. Common dynamic features seen in previous researches are n-grams (Abbasi, et al., 2008; Houvards & Stamatatos, 2006; Diederich et al. 2003; Keselj, Peng, Cercone, Thomas, 2003; Peng, Schuurmans, Wang & Keselj, 2003; William & John, 1994;), POS-grams (Diederich, et al. 2003; Abbasi, et al. 2008), and specific keywords, used to detect the contextual information, author’s preference of specific topic or vocabulary, including both content words (Abbasi, et al. 2008; R. 29.

(43) Zheng, et al. 2006; Diederich, et al., 2003; Martindal & McKenzie, 1995) and function words (Mosteller & Wallace, 1964; Martindale & Mckenzie, 1995; R. Zheng, et al. 2006). Generally speaking, if we view the whole text database of one specific author as a country and words in the kingdom as its population, then the static features serve as demographics, showing the statistical description of the country, while the dynamic features aimed at discovering the relationship inside the population, for example, the neighborhood relationship, for an author might have preference of using a sequence of words consciously or unconsciously, and this is also known as the author’s idiolects.. N-gram Among the dynamic features, n-gram is worthy of attention. Actually the n-gram technique has been one of the most commonly adopted features set in many previous tasks of authorship attribution. For instance, Abbasi, et al. (2008) has integrated the word-level and charater-level features (unigrams, bigrams, and trigrams) as well as syntactic and structural levels of features. Some studies even merely used a bag of n-grams as features, extracting numerous n-grams of different length in an author’s texts and neglected all other features, as Houvards et al (2006) did in their work “N-gram feature selection for authorship identification”, Keselj, et al (2003) did in “N-gram-based author profiles for authorship attribution”, and also William and John (1994) did in “N-gram-based text categorization”, and many others (Stamatatos, Fakotakis, Kokkinakis, 2000, 1999). Although every research has slightly different parameters, weighting, and even differs in the computing method, rangin from the simplest vector distance comparison (Bennett, 1976), Naïve Bayes probability theory (Peng et al., 2003) to sophisticated 30.

(44) SVM classifier (Houvards et al, 2006) in the n-gram experiments, the main purpose and assumption are the same. The n-gram approach can help extract the contextual information, and the more frequent sequences can better represent an author’s style of writing or interested topics. As for the variation, experiments differ in the n-gram length and size. Keselj, et al. reported their best results for 1000 ≤ L ≤ 5000, and 3 ≤ n ≤ 5 , where L is the size of n-gram, and n is the n-gram length (3-gram to 5-gram). Tsuboi & Matsumoto (2002) used unigrams, bigrams and trigrams as feature set in Japanese e-mail documents and gained satisfactory performance. Still some studies didn’t predefine the length of n-gram, and they adopted frequent pattern mining (a machine learning technique generally used in transaction mining to discover the frequent purchased items set) to discover the longest frequent sequences as possible (Ma, et al 2008).. Feature Sets in Chinese Authorship Identification The target language also plays an important role when choosing the features. Since Chinese doesn’t have natural delimiter between words, either a robust and accurate segmentation technique has to apply in order to separate words and obtain word units, or an alternative solution has to be taken by treating every Chinese character as aunigram, and use unspecified n-gram length to compenstate the loss of word information. Ma, et al (2008) tried both segmented and not segmented strategy in two works, “Identifying Chinese E-mail Documents’ Authorship for the Purpose of Computer Forensic”, and “Sequential Pattern Mining for Chinese E-mail Authorship Identification”, respectively, and both reached satisfactory results. In the prior research, 150 emails written by 5 persons were collected, 20 emails for each as the training data, 10 emails as the testing data, and he adopted different levels of feature 31.

(45) type, by his definition, including 1,000 linguistic features (1,000 words ranked by information gain), structural features (mainly static feature, i.e., mean and rate of sentence/paragraph length), and format features (for instance, use of greeting, contain signature or not and etc.) Their result showed the accuracy of different combination of feature set, as in Table 4.. Table 6. The experimental results of different features set combination Features set. Mean F score. F1 (linguistic). 83.04%. F1+ F2 (linguistic + structural). 92.88%. F2+ F3 (structural + format). 97.59%. F1+ F2+ F3 (all). 98.36%. From his result, we can see that although the F1 (1,000 words) had performed well, but the F2 + F3 (structural + format) outperformed the bag of words. The phenomenon may be resulted from the choice of special text format, that is, emails, which had a relatively more fixed format, and thus provided abundant format information. We can compare the result with Ma, et al (2008) “Sequential Pattern Mining for Chinese E-mail Authorship Identification”, given the same text format—emails, and the same target language, he adopted merely the frequent word sequences as the feature set. Although the result was still sound, but the accuracy varied with respect to “the distinctness of author’s pattern features”, according to him. In this experiment only contextual information in emails was used, and therefore the special format information of emails was sacrificed. In sum, n-gram technique was especially popular in Chinese authorship identification and performed well in such 32.

(46) tasks, but the additional information, such as the static features or format features, can still help to improve the final results.. 2.2.4. Feature Weighting. As previous noted, it is difficult to evaluate the contribution of every feature respectively, which leads to a problem of designing either scale or numeral weighting of each feature for calculation purpose. Since in current application in machine learning, no matter what kinds of algorithm is chosen (SVM, Neural Network, Decision Tree, PCA…,etc.), a numeral representation for each document will be need, how to weight each feature and give an appropriate weighting value thus becomes a crucial issue. In particular, there are many weighting ways in practice. Though differing in terms of variations due to different concerns, most of them are based on the frequency of each feature. The following will introduce several popular weighting technique used in this field.. tf-idf In VSM (Vector Space Model) Vector Space Model (VSM) is widely adopted in Information Retrieval domain, and is used to evaluate two document’s similarity in terms of vectors. The method has gained its popularity in the implementation of search engine. When the search engine wants to calculate the similarity between two documents, which might have the same topic and of interest to the users, the words in the two documents will be represented with weighting value based on their frequency. The whole document will be represented in term and weight pairs. For example, V1 = ((t1, w1), (t2, w2), (t3,w3)…(tn, wn)) where the Va is the vector of document 1 representing by term 1 to n, and the weighting vale of term 1 to n. Then vector 2 representing document 2 can be formed 33.