Linguistic Discussion - Result and Discussion

Chapter 4 Result and Discussion

4.5 Linguistic Discussion

Last but not least, the unique characteristics of Chinese play an important role in the choice of linguistic features. Unlike the alphabetic languages, in which words are formed by sequences of sounds, Chinese words are composed of meaningful

units—characters. Due to this special characteristic, fewer Chinese characters than what is needed in alphabetic language can carry sufficient amount of information. The mean length of Chinese words is shorter than that of English words.As a result, the non-segmented unigram in Chinese can provide more information and give a barely satisfactory result, as in both Zheng et al. ( 2006) and the current study, which is around 60% in performing a 5-6 categories classification tasks. In the English case, Keselj, et al (2003) reported their best results for

1000 ≤ L ≤ 5000, and 3 ≤ n ≤ 5 ,

where L is the size of n-gram, and n is the n-gram length (3-gram to 5-gram).

Compared to English, the unigram (character-based) in Chinese is more powerful and efficient than that in the alphabetic language. This accuracy in the Chinese case can be further improved by word segmentation. Although Chinese character bears its own meaning, by adjoining different neighbors, together they derive yet another relative meaning, and form the socio-psychological perception of word units. The segmented (word-based) unigram in Chinese provided a better result than (character-based)

unigram in prediction. Also, according to the result, the (word-based) bigram is not a necessary feature, since the useful contextual information is already captured by unigrams due to Chinese morphology nature.

Chapter 5 Conclusion

The detection of individual writing style, also termed as authorship attribution, authorial style, writeprints and stylometry, has been a popular research interest from different disciplines. Researchers from the stylometry adopt different measures (e.g.

index of vocabulary richness, relative frequency on the function words, etc.) trying to capture the authorial style of well-known authors. Those from the information field make use of classification technique (e.g. PCA, SVM, Neural Network, etc.) to fulfill the heavy demand on the text classification and online fraud detection. Cognitive linguists put their interest on expressions that reflect one’s belief and cognition. Not to mention socio-linguiststs, they are interested in the different language uses among different subcultural groups. Therefore, detection technique of investigating individual language difference is important in many different disciplines.

The current thesis examines individual language differences in Chinese casual context, where a strong writing style of authors is not obvious, and instead a more subconscious language use can be monitored. The experiments performed the authorship identification task on Facebook posts from 6 chosen authors and a SVM classification software package, Liblinear, is used. Among five levels of feature set, the Linguistic feature set (F1), which is comprised of a group of lexicons, contributes 59% and 66% to the accuracy rate (character-based and word-based respectively).

Segmented words outperformed characters in the experiment. In addition, the Punctuation/Symbols feature set (F2) is an important individual features, as they embed emoticons information that is widely adopted by Facebook users. With the incorporation of F2 feature set, the accuracy can reach up to 71.85%. As for the other levels of feature sets, i.e., Structure feature set (F3), Subjectivity feature set (F4), Emotion feature set (F5), F3 contributed little to the authorial style detection in the genre of casual-writing short texts, as a big proportion of Facebook posts is short text, and the structural information of short texts is limited; F4 and F5 feature set didn’t perform well as expected, as the features chosen in these categories have been selected by the Inforamtino Gain technique and have been included in the F1+F2 set already. The inclusion of subjectivity and emotion features in IG featre selection implys that the degree of subjectivity and the use of emotion expression are important indicator in measureing the individual difference on CMC corpus.

The future work of authorship identification on Chinese CMC corpus can consider several directions.

1 For those want to investigate pure linguistic discrimination among indviduals, as the bigrams didn’t show good result in the experiment, another thought of using non-adjacent frequent co-occuring words can be considered. As some idiolects are formed by special sentence structure, in which signature words are not always adjacent to each other.

2 For those want to achieve higher perforamce in text classification, the

meta-information (e.g. time of posting, location, friends who reply the post) can be added to increase the accuracy.

Bibliography

Abbasi, A., & Hsinchun, Chen. (2005). Applying authorship analysis to

extremist-group Web forum messages. Intelligent Systems, IEEE, 20(5), 67-75. doi:

10.1109/MIS.2005.81

Abbasi, Ahmed, & Hsinchun, Chen. (2008). Writeprints: A Stylometric Approach to

Identity-Level Identification and Similarity Detection in Cyberspace. ACM

Transactions on Information Systems, 26(2), 7 1-7 29.

Argamon, Shlomo, Šarić, Marin, & Stein, Sterling S. (2003). Style mining of

electronic messages for multiple authorship discrimination: first results. Paper

presented at the Proceedings of the ninth ACM SIGKDD international conference

on Knowledge discovery and data mining.

Auria, L., & Moro, R. A. (2007). Advantages and Disadvantages of Support Vector

Machines (SVMs). Paper presented at the Credit Risk Assessment Revisited:

Methodological Issues and Practical Implications.

Baayen, R.H., Van Halteren, H., & Tweedie, F.J. (1996). Outside the cave of shadows:

Using syntactic annotation to enhance authorship attribution. Literary and

Linguistic Computing, 11(3), 121-132. doi: 10.1093/llc/11.3.121

Bennett, William Ralph. (1976). Scientific and engineering problem-solving with the

computer: Prentice Hall PTR.

Biber, Douglas. (1991). Variation across speech and writing: Cambridge University

Press.

Burrows, J.F. (1989). "An ocean where each kind...": Statistical analysis and some

major determinants of literary style. Computers and the Humanities, 23, 309-321.

Burrows, John. (2002). 'Delta': a Measure of Stylistic Difference and a Guide to

Likely Authorship. Lit Linguist Computing, 17(3), 267-287. doi:

10.1093/llc/17.3.267

Burrows, John (2003). Questions of Authorship: Attribution and Beyond. Computers

and the Humanities, 5-32.

Burrows, John (2007). All the Way Through: Testing for Authorship in Different

Frequency Strata. Lit Linguist Computing, 22(1), 27-47.

Burrows, John F. (1987). Word-patterns and story-shapes: The statistical analysis of

narrative style. Literary and linguistic Computing, 2(2), 61-70.

Chaikin, David. (2006). Network investigations of cyber attacks: the limits of digital

evidence. Crime, Law and Social Change, 46(4-5), 239-256. doi:

10.1007/s10611-007-9058-4

Chang, Chih-Chung, & Lin, Chih-Jen. (2011). LIBSVM : a library for support vector

machines. ACM Transactions on Intelligent Systems and Technology, 2(3),

27:21--27:27.

Chen, Keh-Jiann, & Liu, Shing-Huan. (1992). Word identification for Mandarin

Chinese sentences. Paper presented at the Proceedings of the 14th conference on

Computational linguistics-Volume 1.

D.L.Wallace, F. Mosteller and. (1998). Text categorization with support vector

machines: Learning with many relevant features. Paper presented at the European

Conference on Machine Learning (ECML).

Diederich, J., Kindermann, J., Leopold, E., and Paass, G. (2003). Authorship

attribution with support vector machines. APPLIED INTELLIGENCE, 19(1-2),

109-123.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin., C.-J. (2008).

LIBLINEAR: A Library for Large Linear Classification. Journal of Machine

Learning Research, 9, 1871-1874.

Hadjidj, Rachid, Debbabi, Mourad, Lounis, Hakim, Iqbal, Farkhund, Szporer, Adam,

& Benredjem, Djamel. (2009). Towards an integrated e-mail forensic analysis

framework. digital investigation, 5(3), 124-137.

Holmes, David I. (1992). A stylometric analysis of Mormon scripture and related texts.

Journal of the Royal Statistical Society. Series A (Statistics in Society), 91-120.

Holmes, David I, & Forsyth, Richard S. (1995). The Federalist revisited: New

directions in authorship attribution. Literary and Linguistic Computing, 10(2),

111-127.

Hoover, David L. (2004). Testing Burrows's delta. Literary and Linguistic Computing,

19(4), 453-475.

Hope, Jonathan. (1994). The Authorship of Shakespeare's Plays: A socio-linguistic

study: Cambridge University Press.

Houvardas, John, & Stamatatos, Efstathios. (2006). N-gram feature selection for

authorship identification Artificial Intelligence: Methodology, Systems, and

Applications (pp. 77-86): Springer.

Iqbal, Farkhund, Binsalleeh, Hamad , Fung, Benjamin, & Debbabi, Mourad. (2010).

Mining writeprints from anonymous e-mails for forensic investigation. digital

investigation, 7(1), 56-64.

Iqbal, Farkhund, Hadjidj, Rachid, Fung, Benjamin, & Debbabi, Mourad. (2008). A

novel approach of mining write-prints for authorship attribution in e-mail forensics.

digital investigation, 5, S42-S51.

Iqbal, Farkhund, Khan, Liaquat A., Fung, Benjamin C. M. , & Debbabi, Mourad.

(2010). e-mail authorship verification for forensic investigation. Paper presented at

the Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre,

Switzerland.

Jaynes, J. T. . (1980). A search for trends in the poetic style of W.B. Yeats. Association

for Literary and Linguistic Computing Journal, 1, 11-19.

Jianbin Ma, Guifa Teng, Shuhui Chang, Xiaoru Zhang, Ke Xiao. (2011). Social

Network Analysis Based on Authorship Identification for Cybercrime Investigation.

In M. Chau, G. A. Wang, X. Zheng, H. Chen, D. Zeng & W. Mao (Eds.),

Intelligence and Security Informatics (Vol. 6749, pp. 27-35): Springer Berlin

Heidelberg.

Jianbin Ma, Ying Li, and Guifa Teng. (2008). Identifying Chinese E-mail Documents'

Authorship for the Purpose of Computer Forensics.

Jianbin Ma, Ying Li, Guifa Teng, Fang Wang, Yang Zhao (2008). Sequential Pattern

Mining for Chinese E-mail Authorship Identification. Paper presented at the

Innovative Computing Information and Control, 2008. ICICIC '08. 3rd

International Conference on.

Joachims, Thorsten. (1998). Text categorization with support vector machines:

Learning with many relevant features: Springer.

Keselj, Vlado , Peng, Fuchun, Cercone, Nick, & Thomas, Calvin. (2003).

N-gram-based author profiles for authorship attribution. Paper presented at the In

Proceedings of the Pacific Association for Computational Linguistics.

Koppel, Moshe , Argamon, Shlomo, & Shimoni, Anat Rachel. (2002). Automatically

Categorizing Written Texts by Author Gender. Literary and Linguistic Computing,

17(4), 401-412. doi: 10.1093/llc/17.4.401

Ma, Jianbin, Teng, Guifa, Zhang, Yuxin, Li, Yueli, & Li, Ying. (2009). A cybercrime

forensic method for chinese web information authorship analysis Intelligence and

Security Informatics (pp. 14-24): Springer.

Martindale, C., & McKenzie, D. (1995). On the utility of content analysis in author

attribution: The Federalist. Computers and the Humanities, 29, 259-270.

Opas, LISA LENA. (1996). A Multi-Dimensional Analysis of Style in Samuel

Beckett's Prose Works. Research in Humanities Computing, 4, 81-114.

Peng, Fuchun, Schuurmans, Dale, Wang, Shaojun, & Keselj, Vlado. (2003). Language

independent authorship attribution using character level language models. Paper

presented at the Proceedings of the tenth conference on European chapter of the

Association for Computational Linguistics-Volume 1.

Rong Zheng, Jiexun Li, Hsinchun Chen, Zan Huang. (2006). A Framework for

authorship identification of online messages: Writing-style features and

classification techniques. Journal of the American Society for Information Science

& Technology, 57(3), 378-393. doi: 10.1002/asi.v57:3

Rudman, J. (1998). The state of authorship attribution studies: Some problem and

solutions. Computers and the Humanities, 31, 351-365.

Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (1999). Automatic

authorship attribution. Paper presented at the Proceedings of the ninth conference

on European chapter of the Association for Computational Linguistics.

Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (2000). Automatic

text categorization in terms of genre and author. Computational linguistics, 26(4),

471-495.

Tsuboi, Yuta, & Matsumoto, Yuji. (2002). Authorship identification for heterogeneous

documents. IPSJ SIG Notes, 17-24.

Tweedie F.J., & Baayen, R.H. (1998). How variable may a constant be? Measures of

lexical richness in perspective. Computers and the Humanities, 32, 323-352.

Vel, O. de, Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for

author identification forensics. SIGMOD Rec., 30(4), 55-64. doi:

10.1145/604264.604272

Whissell, Cynthia. (1996). Traditional and emotional stylometric analysis of the songs

of Beatles Paul McCartney and John Lennon. Computers and the Humanities,

30(3), 257-265.

William B. Cavnar , John M. Trenkle. (1994). N-grambased text categorization. Paper

presented at the In Proceedings of SDAIR-94, 3rd Annual Symposium on

Document Analysis and Information Retrieval.

Yang, Yiming. (1999). An evaluation of statistical approaches to text categorization.

Information retrieval, 1(1-2), 69-90.

Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A Library for

Short-text Classification and Analysis.

Appendix A: The topmost IG-3000 words

Frequency >1000

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

2917 流汗 2 freq

2918 姆額 2 freq

2920 問號 2 freq

2923 吉 2 freq

2930 kiss 2 freq

2931 TEARS 2 freq

2935 抬頭 2 freq

2936 DO 2 freq

2937 原先 2 freq

2941 值班 2 freq

2942 頭腦 2 freq

2943 起勁 2 freq

2948 闔上 2 freq

2951 胸 2 freq

2954 NEED 2 freq

2957 飲 2 freq

125

Appendix B: Error analysis

The following feature tables were generated from the wrongly predicted posts that signature the strong difference of features between author 1 and the rest of the authors. From the model file provided by LibLinear, the contribution score of each feature corresponding to each category was provided. Therefore, by subtracting the contribution score of each feature in the wrongly predicted posts, we can have a rank list that shows the strong features of author_i that misled the classifier into choosing the authori as the correct author. To save space, only features with difference value

exceeds 0.65 will be shown.

The format reads as follows:

diff value: fi _Aj – fi _ Ak (>0.65) | fi _IG | fi | total_frequency | local_frequency f_i _A1 | f_i _A2 | f_i _A3 | f_i _A4 | f_i _A5 | f_i _A6

Where f_i is the feature, f_i _IG is the IG rank, total_frequency denotes the total frequency of the word in the text pool, and the local_frequency shows the frequency of the word in that wrongly predicted post. A_jis wrongly predicted author, and A_kis the original author. The second rows show six contribution value that the specific feature contributes to each author category.

126

Author 1

Table 24. Feautres that causes error prediction of author 2 to author 1

diff:1.30665498671 2 ! 7131 freq 2

Table 25. Feautres that causes error prediction of author 3 to author 1

diff:1.46357669305 9 ！ 1561 freq 2

Table 26. Feautres that causes error prediction of author 4 to author 1

127

Table 27. Feautres that causes error prediction of author 5 to author 1

diff:0.888962975147 8 ) 1364 freq 4 0.595 -0.1362 0.262 0.133 -0.294 -0.945

Table 28. Feautres that causes error prediction of author 6 to author 1

diff:2.24001206941 2 ! 7131 freq 6

128

Table 29. Feautres that causes error prediction of author 1 to author 2

diff:1.08838317804 49 突然 92 freq 1

Table 30. Feautres that causes error prediction of author 3 to author 2

diff:0.858864711769 60 歌 102 freq 1

Table 31. Feautres that causes error prediction of author 4 to author 2

diff:1.11153206506 117 適合 40 freq 1

-0.4007 0.9485 -0.1087 -0.163 -0.1884 -0.2538

129 diff:0.949586003503 60 歌 102 freq 1

-0.295 0.7535 -0.1054 -0.1961 0.0421 -0.2417

diff:0.74133671054 49 突然 92 freq 1

-0.3343 0.7541 -0.0188 0.0127 -0.1877 -0.3952

Table 32. Feautres that causes error prediction of author 5 to author 2

diff:0.828858452521 7 ， 7084 freq 1

Table 33. Feautres that causes error prediction of author 6 to author 2

diff:1.24914156742 7 ， 7084 freq 5

130 diff:0.654473326266 120 要 1506 freq 2

0.6584 -0.0179 -0.1937 -0.1594 0.2435 -0.6723

diff:0.650231839206 153 今日 52 freq 1

0.0429 0.5536 -0.2057 -0.3126 -0.1579 -0.0966

Author 3

Table 34. Feautres that causes error prediction of author 1 to author 3

diff:0.956372908782 2481 颱風 16 freq 1

-0.3663 0.0343 0.5901 -0.1107 0.0549 -0.2093

diff:0.887970801791 835 餓 22 freq 1

-0.2312 -0.0702 0.6567 -0.1092 -0.1062 -0.2062

diff:0.7823516047 118 真 360 freq 1

-0.5691 -0.1851 0.2133 -0.0981 0.1367 0.4883

diff:0.769295624706 390 QQ 41 freq 1

-0.2039 -0.0215 0.5654 -0.2792 -0.2458 -0.3396

Table 35. Feautres that causes error prediction of author 2 to author 3

diff:0.937099579439 1 : 1082 freq 1

-0.2857 -0.3576 0.5795 -0.1641 0.295 -0.4204

diff:0.745151918042 15 ? 670 freq 1

-0.6638 -0.0917 0.6535 0.6422 0.5312 -1.249

Table 36. Feautres that causes error prediction of author 4 to author 3

diff:0.844583860266 390 QQ 41 freq 1

-0.2039 -0.0215 0.5654 -0.2792 -0.2458 -0.3396

diff:0.664421700314 149 好 1455 freq 2

0.2446 -0.0309 0.5315 -0.133 -0.2517 -0.5358

Table 37. Feautres that causes error prediction of author 5 to author 3

None

Table 38. Feautres that causes error prediction of author 6 to author 3

diff:1.06730632191 149 好 1455 freq 1

131

Table 39. Feautres that causes error prediction of author 1 to author 4

diff:1.1226545405 5 。 4137 freq 1

Table 40. Feautres that causes error prediction of author 2 to author 4

diff:1.48326436548 5 。 4137 freq 1

0.2467 -0.1139 -0.5344 1.3694 -0.8845 -0.4593

Table 41. Feautres that causes error prediction of author 3 to author 4

diff:1.90377858134 5 。 4137 freq 3

Table 42. Feautres that causes error prediction of author 5 to author 4

diff:1.03268721414 89 ... 117 freq 1

-0.3743 -0.4866 0.0039 1.1147 0.082 -0.5052

132 diff:0.917536606694 94 . 739 freq 1

-0.1669 -0.3296 -0.1318 0.4841 -0.4335 0.3674

diff:0.785627958427 54 ~ 1434 freq 1

0.0251 0.265 0.1496 0.4914 -0.2942 -0.9064

Table 43. Feautres that causes error prediction of author 6 to author 4

diff:1.82868129581 5 。 4137 freq 3

Table 44. Feautres that causes error prediction of author 1 to author 5

diff:1.03009234522 4 DD 264 freq 1

133 diff:0.824997334232 35 我 4732 freq 2

-0.1354 0.4315 0.0195 0.2667 -0.2327 -0.5583

diff:0.737771644539 61 和 476 freq 1

-0.2181 0.0975 0.005 0.2787 0.0842 -0.459

diff:0.701832324608 162 愛 228 freq 1

0.2563 -0.0494 -0.1597 0.2694 0.0348 -0.4324

Table 45. Feautres that causes error prediction of author 2 to author 5

diff:0.656120381703 490 祝 25 freq 1

0.076 -0.1975 -0.124 -0.1066 0.4586 -0.2509

Table 46. Feautres that causes error prediction of author 3 to author 5

diff:0.92074197507 24 明天 228 freq 1

-0.2522 -0.1305 -0.216 -0.0975 0.7047 -0.1659

diff:0.651048337142 22 大家 568 freq 2

-0.4735 0.0902 -0.2451 0.3484 0.406 -0.2252

Table 47. Feautres that causes error prediction of author 4 to author 5

None

Table 48. Feautres that causes error prediction of author 6 to author 5

None

Author 6

Table 49. Feautres that causes error prediction of author 1 to author 6

diff:3.63795036759 3 ... 1517 freq 9

-0.7437 -1.2295 -0.1866 -0.6755 -0.9828 2.8943

diff:2.52333633807 6 , 2069 freq 5

-0.7881 -0.8687 0.0526 -0.6113 -0.0434 1.7353

diff:2.12522189498 19 阿 434 freq 2

-0.4668 0.1393 -0.8973 -0.7217 -0.0847 1.6585

diff:1.12136490307 1518 than 17 freq 3

-0.5004 -0.0488 -0.074 -0.1448 0.0829 0.621

134

Table 50. Feautres that causes error prediction of author 2 to author 6

diff:4.12377135027 3 ... 1517 freq 3

135

Table 51. Feautres that causes error prediction of author 3 to author 6

diff:3.08092418945 3 ... 1517 freq 6

-0.7437 -1.2295 -0.1866 -0.6755 -0.9828 2.8943

diff:2.5557665726 19 阿 434 freq 1

-0.4668 0.1393 -0.8973 -0.7217 -0.0847 1.6585

136

Table 52. Feautres that causes error prediction of author 4 to author 6

diff:3.56982113006 3 ... 1517 freq 2

Table 53. Feautres that causes error prediction of author 5 to author 6

diff:1.77863087152 6 , 2069 freq 2

137

-0.1669 -0.3296 -0.1318 0.4841 -0.4335 0.3674

diff:0.707352109516 18 的 8355 freq 5

-0.2963 0.396 -0.2818 -0.099 -0.394 0.3134

diff:0.702001133705 112 有趣 112 freq 1

-0.3184 -0.0348 0.0618 -0.2023 -0.2194 0.4826

在文檔中中文文本作者辨識研究: 以社群網站--臉書為例 (頁 100-0)