Highlights đź’ˇ:
Source: @stamatatos2009
HinzugefĂĽgt am 2022-11-06
■detailed study by Mosteller and Wallace (1964) on the authorship of “The Federalist Papers” (p. 1)
■until the late 1990s, research in authorship attribution was dominated by attempts to define features for quantifying writing style, a line of research known as “stylometry” (Holmes, 1994, 1998) (p. 1)
â– In certain cases, there were methods that achieved impressive preliminary results and made many people think that the solution of this problem was too close. (p. 1)
â– the plethora of available electronic texts revealed the potential of authorship analysis in various applications (Madigan, Genkin, Lewis, Argamon, Fradkin, & Ye, 2005) (p. 2)
â– factors playing a crucial role in the accuracy of the produced models are examined, such as the training text size (Hirst & Feiguina, 2007; Marton, Wu, & Hellerstein, 2005), the number of candidate authors (Koppel, Schler, Argamon, & Messeri, 2006), and the distribution of training texts over the candidate authors (Stamatatos, 2008) (p. 2)
â–
- [b] Types of stylometric features and the tools required to measure them
â– In every authorship-identification problem, there is a set of candidate authors, a set of text samples of known authorship covering all the candidate authors (training corpus), and a set of text samples of unknown authorship (test corpus) (p. 8)
â–
â–
- [b] profile-based vs instance-based approaches to AA
â– From a marginal scientific area dealing only with famous cases of disputed or unknown authorship of literary works, authorship attribution now provides robust methods able to handle real-world texts with relatively high-accuracy results. (p. 16)
â– Authorship attribution can be viewed as a typical textcategorization task (p. 16)
â– in style-based text categorization, the most significant features are the most frequent ones (Houvardas & Stamatatos, 2006; Koppel, Akiva, & Dagan, 2006) while in topicbased text categorization, the best features should be selected based on their discriminatory power (Forman, 2003) (p. 16)
â– extremely limited training text material (p. 16)
â– in most cases, the distribution of training texts over the candidate authors is imbalanced (p. 16)
â– How long should a text be so that we can adequately capture its stylistic properties? Various studies have reported promising results dealing with short texts (p. 16)
â– what are the minimum requirements in training text (p. 16)
â– Another important question is how to discriminate between the three basic factors: authorship, genre, and topic. (p. 16)
â– The accuracy of current authorship attribution technology depends mainly on the number of candidate authors, the size of texts, and the amount of training texts. (p. 17)
■An important obstacle is that it is not yet possible to explain the differences between the authors’ style. (p. 17)
â– in the framework of forensic applications, the open-set classification setting is the most suitable (p. 17)
â– A significant advance of the authorship attribution technology during the last years was the adoption of objective evaluation criteria and the comparison of different methodologies using the same benchmark corpora, following the practice of thematic text categorization. (p. 17)
â– A crucial issue is to increase the available benchmark corpora so that they cover many natural languages and text domains. It also is very important for the evaluation corpora to offer control over genre, topic, and demographic criteria. (p. 17)