Chapter 4 Result and Discussion
4.5 Linguistic Discussion
Last but not least, the unique characteristics of Chinese play an important role in the choice of linguistic features. Unlike the alphabetic languages, in which words are formed by sequences of sounds, Chinese words are composed of meaningful
units—characters. Due to this special characteristic, fewer Chinese characters than what is needed in alphabetic language can carry sufficient amount of information. The mean length of Chinese words is shorter than that of English words.As a result, the non-segmented unigram in Chinese can provide more information and give a barely satisfactory result, as in both Zheng et al. ( 2006) and the current study, which is around 60% in performing a 5-6 categories classification tasks. In the English case, Keselj, et al (2003) reported their best results for
1000 ≤ L ≤ 5000, and 3 ≤ n ≤ 5 ,
where L is the size of n-gram, and n is the n-gram length (3-gram to 5-gram).Compared to English, the unigram (character-based) in Chinese is more powerful and efficient than that in the alphabetic language. This accuracy in the Chinese case can be further improved by word segmentation. Although Chinese character bears its own meaning, by adjoining different neighbors, together they derive yet another relative meaning, and form the socio-psychological perception of word units. The segmented (word-based) unigram in Chinese provided a better result than (character-based)
88
unigram in prediction. Also, according to the result, the (word-based) bigram is not a necessary feature, since the useful contextual information is already captured by unigrams due to Chinese morphology nature.
89
Chapter 5 Conclusion
The detection of individual writing style, also termed as authorship attribution, authorial style, writeprints and stylometry, has been a popular research interest from different disciplines. Researchers from the stylometry adopt different measures (e.g.
index of vocabulary richness, relative frequency on the function words, etc.) trying to capture the authorial style of well-known authors. Those from the information field make use of classification technique (e.g. PCA, SVM, Neural Network, etc.) to fulfill the heavy demand on the text classification and online fraud detection. Cognitive linguists put their interest on expressions that reflect one’s belief and cognition. Not to mention socio-linguiststs, they are interested in the different language uses among different subcultural groups. Therefore, detection technique of investigating individual language difference is important in many different disciplines.
The current thesis examines individual language differences in Chinese casual context, where a strong writing style of authors is not obvious, and instead a more subconscious language use can be monitored. The experiments performed the authorship identification task on Facebook posts from 6 chosen authors and a SVM classification software package, Liblinear, is used. Among five levels of feature set, the Linguistic feature set (F1), which is comprised of a group of lexicons, contributes 59% and 66% to the accuracy rate (character-based and word-based respectively).
90
Segmented words outperformed characters in the experiment. In addition, the Punctuation/Symbols feature set (F2) is an important individual features, as they embed emoticons information that is widely adopted by Facebook users. With the incorporation of F2 feature set, the accuracy can reach up to 71.85%. As for the other levels of feature sets, i.e., Structure feature set (F3), Subjectivity feature set (F4), Emotion feature set (F5), F3 contributed little to the authorial style detection in the genre of casual-writing short texts, as a big proportion of Facebook posts is short text, and the structural information of short texts is limited; F4 and F5 feature set didn’t perform well as expected, as the features chosen in these categories have been selected by the Inforamtino Gain technique and have been included in the F1+F2 set already. The inclusion of subjectivity and emotion features in IG featre selection implys that the degree of subjectivity and the use of emotion expression are important indicator in measureing the individual difference on CMC corpus.
The future work of authorship identification on Chinese CMC corpus can consider several directions.
1 For those want to investigate pure linguistic discrimination among indviduals, as the bigrams didn’t show good result in the experiment, another thought of using non-adjacent frequent co-occuring words can be considered. As some idiolects are formed by special sentence structure, in which signature words are not always adjacent to each other.
2 For those want to achieve higher perforamce in text classification, the
meta-information (e.g. time of posting, location, friends who reply the post) can be added to increase the accuracy.
91
Bibliography
Abbasi, A., & Hsinchun, Chen. (2005). Applying authorship analysis to
extremist-group Web forum messages. Intelligent Systems, IEEE, 20(5), 67-75. doi:
10.1109/MIS.2005.81
Abbasi, Ahmed, & Hsinchun, Chen. (2008). Writeprints: A Stylometric Approach to
Identity-Level Identification and Similarity Detection in Cyberspace. ACM
Transactions on Information Systems, 26(2), 7 1-7 29.
Argamon, Shlomo, Šarić, Marin, & Stein, Sterling S. (2003). Style mining of
electronic messages for multiple authorship discrimination: first results. Paper
presented at the Proceedings of the ninth ACM SIGKDD international conference
on Knowledge discovery and data mining.
Auria, L., & Moro, R. A. (2007). Advantages and Disadvantages of Support Vector
Machines (SVMs). Paper presented at the Credit Risk Assessment Revisited:
Methodological Issues and Practical Implications.
Baayen, R.H., Van Halteren, H., & Tweedie, F.J. (1996). Outside the cave of shadows:
Using syntactic annotation to enhance authorship attribution. Literary and
Linguistic Computing, 11(3), 121-132. doi: 10.1093/llc/11.3.121
Bennett, William Ralph. (1976). Scientific and engineering problem-solving with the
92
computer: Prentice Hall PTR.
Biber, Douglas. (1991). Variation across speech and writing: Cambridge University
Press.
Burrows, J.F. (1989). "An ocean where each kind...": Statistical analysis and some
major determinants of literary style. Computers and the Humanities, 23, 309-321.
Burrows, John. (2002). 'Delta': a Measure of Stylistic Difference and a Guide to
Likely Authorship. Lit Linguist Computing, 17(3), 267-287. doi:
10.1093/llc/17.3.267
Burrows, John (2003). Questions of Authorship: Attribution and Beyond. Computers
and the Humanities, 5-32.
Burrows, John (2007). All the Way Through: Testing for Authorship in Different
Frequency Strata. Lit Linguist Computing, 22(1), 27-47.
Burrows, John F. (1987). Word-patterns and story-shapes: The statistical analysis of
narrative style. Literary and linguistic Computing, 2(2), 61-70.
Chaikin, David. (2006). Network investigations of cyber attacks: the limits of digital
evidence. Crime, Law and Social Change, 46(4-5), 239-256. doi:
10.1007/s10611-007-9058-4
Chang, Chih-Chung, & Lin, Chih-Jen. (2011). LIBSVM : a library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, 2(3),
93
27:21--27:27.
Chen, Keh-Jiann, & Liu, Shing-Huan. (1992). Word identification for Mandarin
Chinese sentences. Paper presented at the Proceedings of the 14th conference on
Computational linguistics-Volume 1.
D.L.Wallace, F. Mosteller and. (1998). Text categorization with support vector
machines: Learning with many relevant features. Paper presented at the European
Conference on Machine Learning (ECML).
Diederich, J., Kindermann, J., Leopold, E., and Paass, G. (2003). Authorship
attribution with support vector machines. APPLIED INTELLIGENCE, 19(1-2),
109-123.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin., C.-J. (2008).
LIBLINEAR: A Library for Large Linear Classification. Journal of Machine
Learning Research, 9, 1871-1874.
Hadjidj, Rachid, Debbabi, Mourad, Lounis, Hakim, Iqbal, Farkhund, Szporer, Adam,
& Benredjem, Djamel. (2009). Towards an integrated e-mail forensic analysis
framework. digital investigation, 5(3), 124-137.
Holmes, David I. (1992). A stylometric analysis of Mormon scripture and related texts.
Journal of the Royal Statistical Society. Series A (Statistics in Society), 91-120.
Holmes, David I, & Forsyth, Richard S. (1995). The Federalist revisited: New
94
directions in authorship attribution. Literary and Linguistic Computing, 10(2),
111-127.
Hoover, David L. (2004). Testing Burrows's delta. Literary and Linguistic Computing,
19(4), 453-475.
Hope, Jonathan. (1994). The Authorship of Shakespeare's Plays: A socio-linguistic
study: Cambridge University Press.
Houvardas, John, & Stamatatos, Efstathios. (2006). N-gram feature selection for
authorship identification Artificial Intelligence: Methodology, Systems, and
Applications (pp. 77-86): Springer.
Iqbal, Farkhund, Binsalleeh, Hamad , Fung, Benjamin, & Debbabi, Mourad. (2010).
Mining writeprints from anonymous e-mails for forensic investigation. digital
investigation, 7(1), 56-64.
Iqbal, Farkhund, Hadjidj, Rachid, Fung, Benjamin, & Debbabi, Mourad. (2008). A
novel approach of mining write-prints for authorship attribution in e-mail forensics.
digital investigation, 5, S42-S51.
Iqbal, Farkhund, Khan, Liaquat A., Fung, Benjamin C. M. , & Debbabi, Mourad.
(2010). e-mail authorship verification for forensic investigation. Paper presented at
the Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre,
Switzerland.
95
Jaynes, J. T. . (1980). A search for trends in the poetic style of W.B. Yeats. Association
for Literary and Linguistic Computing Journal, 1, 11-19.
Jianbin Ma, Guifa Teng, Shuhui Chang, Xiaoru Zhang, Ke Xiao. (2011). Social
Network Analysis Based on Authorship Identification for Cybercrime Investigation.
In M. Chau, G. A. Wang, X. Zheng, H. Chen, D. Zeng & W. Mao (Eds.),
Intelligence and Security Informatics (Vol. 6749, pp. 27-35): Springer Berlin
Heidelberg.
Jianbin Ma, Ying Li, and Guifa Teng. (2008). Identifying Chinese E-mail Documents'
Authorship for the Purpose of Computer Forensics.
Jianbin Ma, Ying Li, Guifa Teng, Fang Wang, Yang Zhao (2008). Sequential Pattern
Mining for Chinese E-mail Authorship Identification. Paper presented at the
Innovative Computing Information and Control, 2008. ICICIC '08. 3rd
International Conference on.
Joachims, Thorsten. (1998). Text categorization with support vector machines:
Learning with many relevant features: Springer.
Keselj, Vlado , Peng, Fuchun, Cercone, Nick, & Thomas, Calvin. (2003).
N-gram-based author profiles for authorship attribution. Paper presented at the In
Proceedings of the Pacific Association for Computational Linguistics.
Koppel, Moshe , Argamon, Shlomo, & Shimoni, Anat Rachel. (2002). Automatically
96
Categorizing Written Texts by Author Gender. Literary and Linguistic Computing,
17(4), 401-412. doi: 10.1093/llc/17.4.401
Ma, Jianbin, Teng, Guifa, Zhang, Yuxin, Li, Yueli, & Li, Ying. (2009). A cybercrime
forensic method for chinese web information authorship analysis Intelligence and
Security Informatics (pp. 14-24): Springer.
Martindale, C., & McKenzie, D. (1995). On the utility of content analysis in author
attribution: The Federalist. Computers and the Humanities, 29, 259-270.
Opas, LISA LENA. (1996). A Multi-Dimensional Analysis of Style in Samuel
Beckett's Prose Works. Research in Humanities Computing, 4, 81-114.
Peng, Fuchun, Schuurmans, Dale, Wang, Shaojun, & Keselj, Vlado. (2003). Language
independent authorship attribution using character level language models. Paper
presented at the Proceedings of the tenth conference on European chapter of the
Association for Computational Linguistics-Volume 1.
Rong Zheng, Jiexun Li, Hsinchun Chen, Zan Huang. (2006). A Framework for
authorship identification of online messages: Writing-style features and
classification techniques. Journal of the American Society for Information Science
& Technology, 57(3), 378-393. doi: 10.1002/asi.v57:3
Rudman, J. (1998). The state of authorship attribution studies: Some problem and
solutions. Computers and the Humanities, 31, 351-365.
97
Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (1999). Automatic
authorship attribution. Paper presented at the Proceedings of the ninth conference
on European chapter of the Association for Computational Linguistics.
Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (2000). Automatic
text categorization in terms of genre and author. Computational linguistics, 26(4),
471-495.
Tsuboi, Yuta, & Matsumoto, Yuji. (2002). Authorship identification for heterogeneous
documents. IPSJ SIG Notes, 17-24.
Tweedie F.J., & Baayen, R.H. (1998). How variable may a constant be? Measures of
lexical richness in perspective. Computers and the Humanities, 32, 323-352.
Vel, O. de, Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for
author identification forensics. SIGMOD Rec., 30(4), 55-64. doi:
10.1145/604264.604272
Whissell, Cynthia. (1996). Traditional and emotional stylometric analysis of the songs
of Beatles Paul McCartney and John Lennon. Computers and the Humanities,
30(3), 257-265.
William B. Cavnar , John M. Trenkle. (1994). N-grambased text categorization. Paper
presented at the In Proceedings of SDAIR-94, 3rd Annual Symposium on
Document Analysis and Information Retrieval.
98
Yang, Yiming. (1999). An evaluation of statistical approaches to text categorization.
Information retrieval, 1(1-2), 69-90.
Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A Library for
Short-text Classification and Analysis.
99
Appendix A: The topmost IG-3000 words
Frequency >1000
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
2917 流汗 2 freq
2918 姆額 2 freq
2920 問號 2 freq
2923 吉 2 freq
2930 kiss 2 freq
2931 TEARS 2 freq
2935 抬頭 2 freq
2936 DO 2 freq
2937 原先 2 freq
2941 值班 2 freq
2942 頭腦 2 freq
2943 起勁 2 freq
2948 闔上 2 freq
2951 胸 2 freq
2954 NEED 2 freq
2957 飲 2 freq
125
Appendix B: Error analysis
The following feature tables were generated from the wrongly predicted posts that signature the strong difference of features between author 1 and the rest of the authors. From the model file provided by LibLinear, the contribution score of each feature corresponding to each category was provided. Therefore, by subtracting the contribution score of each feature in the wrongly predicted posts, we can have a rank list that shows the strong features of authori that misled the classifier into choosing the authori as the correct author. To save space, only features with difference value
exceeds 0.65 will be shown.
The format reads as follows:
diff value: fi _Aj – fi _ Ak (>0.65) | fi _IG | fi | total_frequency | local_frequency fi _A1 | fi _A2 | fi _A3 | fi _A4 | fi _A5 | fi _A6
Where fi is the feature, fi _IG is the IG rank, total_frequency denotes the total frequency of the word in the text pool, and the local_frequency shows the frequency of the word in that wrongly predicted post. Aj is wrongly predicted author, and Ak is the original author. The second rows show six contribution value that the specific feature contributes to each author category.
126
Author 1
Table 24. Feautres that causes error prediction of author 2 to author 1
diff:1.30665498671 2 ! 7131 freq 2
Table 25. Feautres that causes error prediction of author 3 to author 1
diff:1.46357669305 9 ! 1561 freq 2
Table 26. Feautres that causes error prediction of author 4 to author 1
127
Table 27. Feautres that causes error prediction of author 5 to author 1
diff:0.888962975147 8 ) 1364 freq 4 0.595 -0.1362 0.262 0.133 -0.294 -0.945
Table 28. Feautres that causes error prediction of author 6 to author 1
diff:2.24001206941 2 ! 7131 freq 6
128
Table 29. Feautres that causes error prediction of author 1 to author 2
diff:1.08838317804 49 突然 92 freq 1
Table 30. Feautres that causes error prediction of author 3 to author 2
diff:0.858864711769 60 歌 102 freq 1
Table 31. Feautres that causes error prediction of author 4 to author 2
diff:1.11153206506 117 適合 40 freq 1
-0.4007 0.9485 -0.1087 -0.163 -0.1884 -0.2538
129 diff:0.949586003503 60 歌 102 freq 1
-0.295 0.7535 -0.1054 -0.1961 0.0421 -0.2417
diff:0.74133671054 49 突然 92 freq 1
-0.3343 0.7541 -0.0188 0.0127 -0.1877 -0.3952
Table 32. Feautres that causes error prediction of author 5 to author 2
diff:0.828858452521 7 , 7084 freq 1
Table 33. Feautres that causes error prediction of author 6 to author 2
diff:1.24914156742 7 , 7084 freq 5
130 diff:0.654473326266 120 要 1506 freq 2
0.6584 -0.0179 -0.1937 -0.1594 0.2435 -0.6723
diff:0.650231839206 153 今日 52 freq 1
0.0429 0.5536 -0.2057 -0.3126 -0.1579 -0.0966
Author 3
Table 34. Feautres that causes error prediction of author 1 to author 3
diff:0.956372908782 2481 颱風 16 freq 1
-0.3663 0.0343 0.5901 -0.1107 0.0549 -0.2093
diff:0.887970801791 835 餓 22 freq 1
-0.2312 -0.0702 0.6567 -0.1092 -0.1062 -0.2062
diff:0.7823516047 118 真 360 freq 1
-0.5691 -0.1851 0.2133 -0.0981 0.1367 0.4883
diff:0.769295624706 390 QQ 41 freq 1
-0.2039 -0.0215 0.5654 -0.2792 -0.2458 -0.3396
Table 35. Feautres that causes error prediction of author 2 to author 3
diff:0.937099579439 1 : 1082 freq 1
-0.2857 -0.3576 0.5795 -0.1641 0.295 -0.4204
diff:0.745151918042 15 ? 670 freq 1
-0.6638 -0.0917 0.6535 0.6422 0.5312 -1.249
Table 36. Feautres that causes error prediction of author 4 to author 3
diff:0.844583860266 390 QQ 41 freq 1
-0.2039 -0.0215 0.5654 -0.2792 -0.2458 -0.3396
diff:0.664421700314 149 好 1455 freq 2
0.2446 -0.0309 0.5315 -0.133 -0.2517 -0.5358
Table 37. Feautres that causes error prediction of author 5 to author 3
None
Table 38. Feautres that causes error prediction of author 6 to author 3
diff:1.06730632191 149 好 1455 freq 1
131
Table 39. Feautres that causes error prediction of author 1 to author 4
diff:1.1226545405 5 。 4137 freq 1
Table 40. Feautres that causes error prediction of author 2 to author 4
diff:1.48326436548 5 。 4137 freq 1
0.2467 -0.1139 -0.5344 1.3694 -0.8845 -0.4593
Table 41. Feautres that causes error prediction of author 3 to author 4
diff:1.90377858134 5 。 4137 freq 3
Table 42. Feautres that causes error prediction of author 5 to author 4
diff:1.03268721414 89 ... 117 freq 1
-0.3743 -0.4866 0.0039 1.1147 0.082 -0.5052
132 diff:0.917536606694 94 . 739 freq 1
-0.1669 -0.3296 -0.1318 0.4841 -0.4335 0.3674
diff:0.785627958427 54 ~ 1434 freq 1
0.0251 0.265 0.1496 0.4914 -0.2942 -0.9064
Table 43. Feautres that causes error prediction of author 6 to author 4
diff:1.82868129581 5 。 4137 freq 3
Table 44. Feautres that causes error prediction of author 1 to author 5
diff:1.03009234522 4 DD 264 freq 1
133 diff:0.824997334232 35 我 4732 freq 2
-0.1354 0.4315 0.0195 0.2667 -0.2327 -0.5583
diff:0.737771644539 61 和 476 freq 1
-0.2181 0.0975 0.005 0.2787 0.0842 -0.459
diff:0.701832324608 162 愛 228 freq 1
0.2563 -0.0494 -0.1597 0.2694 0.0348 -0.4324
Table 45. Feautres that causes error prediction of author 2 to author 5
diff:0.656120381703 490 祝 25 freq 1
0.076 -0.1975 -0.124 -0.1066 0.4586 -0.2509
Table 46. Feautres that causes error prediction of author 3 to author 5
diff:0.92074197507 24 明天 228 freq 1
-0.2522 -0.1305 -0.216 -0.0975 0.7047 -0.1659
diff:0.651048337142 22 大家 568 freq 2
-0.4735 0.0902 -0.2451 0.3484 0.406 -0.2252
Table 47. Feautres that causes error prediction of author 4 to author 5
None
Table 48. Feautres that causes error prediction of author 6 to author 5
None
Author 6
Table 49. Feautres that causes error prediction of author 1 to author 6
diff:3.63795036759 3 ... 1517 freq 9
-0.7437 -1.2295 -0.1866 -0.6755 -0.9828 2.8943
diff:2.52333633807 6 , 2069 freq 5
-0.7881 -0.8687 0.0526 -0.6113 -0.0434 1.7353
diff:2.12522189498 19 阿 434 freq 2
-0.4668 0.1393 -0.8973 -0.7217 -0.0847 1.6585
diff:1.12136490307 1518 than 17 freq 3
-0.5004 -0.0488 -0.074 -0.1448 0.0829 0.621
134
Table 50. Feautres that causes error prediction of author 2 to author 6
diff:4.12377135027 3 ... 1517 freq 3
135
Table 51. Feautres that causes error prediction of author 3 to author 6
diff:3.08092418945 3 ... 1517 freq 6
-0.7437 -1.2295 -0.1866 -0.6755 -0.9828 2.8943
diff:2.5557665726 19 阿 434 freq 1
-0.4668 0.1393 -0.8973 -0.7217 -0.0847 1.6585
136
Table 52. Feautres that causes error prediction of author 4 to author 6
diff:3.56982113006 3 ... 1517 freq 2
Table 53. Feautres that causes error prediction of author 5 to author 6
diff:1.77863087152 6 , 2069 freq 2
137
-0.1669 -0.3296 -0.1318 0.4841 -0.4335 0.3674
diff:0.707352109516 18 的 8355 freq 5
-0.2963 0.396 -0.2818 -0.099 -0.394 0.3134
diff:0.702001133705 112 有趣 112 freq 1
-0.3184 -0.0348 0.0618 -0.2023 -0.2194 0.4826