Chapter 2 Literature Review
2.2 Authorship Analysis
2.2.3 Feature Selection
Feature selection is an essential yet complicated issue in authorship identification.
It has long been a research question that is still open to debate and has not yet reached a consensus of the optimal feature sets. As Diederich, Kindermann, Leopold & Paass (2003) said, even when 1,000 style markers have been specified (Rudman, 1998), there is no concensus on the signifincant style markers. They conlude almost all words contain some information in the text categorization.They also mentioned a carefully conducted research made by Joachims (1998) in examining the information contribution with respect to different number of features.
“Joachims ranked 10,000 word stems of a large corpus according to their information gain with regard to some classification. It turned out that a model using features with ranks 201-500 performed nearly as well as the best features in the top 1-200, and similar to the feature set 4001-9962.”
From Joachim’s experiment, even when information gain (a method widely used in
29
classification to calculate the increased amount of information one additional feature carried and contributed to the result) were taken into consideration, features ranked 1-200 did not outperform those ranked 201-500, nor those ranked 4001-9962.
Numerous reasons can account for the difficulty of feature selection, but the main reason is because language itself is sophisticated and complex, composed by lexical, syntactic, semantic, structural, cognitive levels and so on. Many features of different linguistic levels contribute to an author’s language use all together. Features can be further divided into two categories—static and dynamic. (Abbasi, 2008)
Static and dynamic features
Static features refer to those features which can be calculated in every author’s texts. Common static features are the mean word length, mean sentence length in terms of word and characters, number of words, number of lines, vocabulary richness, hapax legomena (frequency of once-occurring words) etc. These features can be applied to every training and testing piece of text, because they are simply numeral characteristics representing each texts based on frequency, length, and average frequency or length (in terms of rate).
On the other hand, dynamic features are not as instinctive as static ones. In contrast to the static features, dynamic features vary with respect to distinctive author’s writing habit. Common dynamic features seen in previous researches are n-grams (Abbasi, et al., 2008; Houvards & Stamatatos, 2006; Diederich et al. 2003;
Keselj, Peng, Cercone, Thomas, 2003; Peng, Schuurmans, Wang & Keselj, 2003;
William & John, 1994;), POS-grams (Diederich, et al. 2003; Abbasi, et al. 2008), and specific keywords, used to detect the contextual information, author’s preference of specific topic or vocabulary, including both content words (Abbasi, et al. 2008; R.
30
Zheng, et al. 2006; Diederich, et al., 2003; Martindal & McKenzie, 1995) and
function words (Mosteller & Wallace, 1964; Martindale & Mckenzie, 1995; R. Zheng, et al. 2006).
Generally speaking, if we view the whole text database of one specific author as a country and words in the kingdom as its population, then the static features serve as demographics, showing the statistical description of the country, while the dynamic features aimed at discovering the relationship inside the population, for example, the neighborhood relationship, for an author might have preference of using a sequence of words consciously or unconsciously, and this is also known as the author’s idiolects.
N-gram
Among the dynamic features, n-gram is worthy of attention. Actually the n-gram technique has been one of the most commonly adopted features set in many previous tasks of authorship attribution. For instance, Abbasi, et al. (2008) has integrated the word-level and charater-level features (unigrams, bigrams, and trigrams) as well as syntactic and structural levels of features. Some studies even merely used a bag of n-grams as features, extracting numerous n-grams of different length in an author’s texts and neglected all other features, as Houvards et al (2006) did in their work
“N-gram feature selection for authorship identification”, Keselj, et al (2003) did in
“N-gram-based author profiles for authorship attribution”, and also William and John (1994) did in “N-gram-based text categorization”, and many others (Stamatatos, Fakotakis, Kokkinakis, 2000, 1999).
Although every research has slightly different parameters, weighting, and even differs in the computing method, rangin from the simplest vector distance comparison (Bennett, 1976), Naïve Bayes probability theory (Peng et al., 2003) to sophisticated
31
SVM classifier (Houvards et al, 2006) in the n-gram experiments, the main purpose and assumption are the same. The n-gram approach can help extract the contextual information, and the more frequent sequences can better represent an author’s style of writing or interested topics. As for the variation, experiments differ in the n-gram length and size. Keselj, et al. reported their best results for
1000 ≤ L ≤ 5000, and 3 ≤ n ≤ 5 , where L is the size of n-gram, and n is the
n-gram length (3-gram to 5-gram). Tsuboi & Matsumoto (2002) used unigrams, bigrams and trigrams as feature set in Japanese e-mail documents and gainedsatisfactory performance. Still some studies didn’t predefine the length of n-gram, and they adopted frequent pattern mining (a machine learning technique generally used in transaction mining to discover the frequent purchased items set) to discover the longest frequent sequences as possible (Ma, et al 2008).
Feature Sets in Chinese Authorship Identification
The target language also plays an important role when choosing the features.
Since Chinese doesn’t have natural delimiter between words, either a robust and accurate segmentation technique has to apply in order to separate words and obtain word units, or an alternative solution has to be taken by treating every Chinese character as aunigram, and use unspecified n-gram length to compenstate the loss of word information. Ma, et al (2008) tried both segmented and not segmented strategy in two works, “Identifying Chinese E-mail Documents’ Authorship for the Purpose of Computer Forensic”, and “Sequential Pattern Mining for Chinese E-mail Authorship Identification”, respectively, and both reached satisfactory results. In the prior
research, 150 emails written by 5 persons were collected, 20 emails for each as the training data, 10 emails as the testing data, and he adopted different levels of feature
32
type, by his definition, including 1,000 linguistic features (1,000 words ranked by information gain), structural features (mainly static feature, i.e., mean and rate of sentence/paragraph length), and format features (for instance, use of greeting, contain signature or not and etc.) Their result showed the accuracy of different combination of feature set, as in Table 4.
Table 6. The experimental results of different features set combination
Features set Mean F score
F1 (linguistic) 83.04%
F1+ F2 (linguistic + structural) 92.88%
F2+ F3 (structural + format) 97.59%
F1+ F2+ F3 (all) 98.36%
From his result, we can see that although the F1 (1,000 words) had performed well, but the F2 + F3 (structural + format) outperformed the bag of words. The
phenomenon may be resulted from the choice of special text format, that is, emails, which had a relatively more fixed format, and thus provided abundant format information. We can compare the result with Ma, et al (2008) “Sequential Pattern Mining for Chinese E-mail Authorship Identification”, given the same text
format—emails, and the same target language, he adopted merely the frequent word sequences as the feature set. Although the result was still sound, but the accuracy varied with respect to “the distinctness of author’s pattern features”, according to him.
In this experiment only contextual information in emails was used, and therefore the special format information of emails was sacrificed. In sum, n-gram technique was especially popular in Chinese authorship identification and performed well in such
33
tasks, but the additional information, such as the static features or format features, can still help to improve the final results.