Source: @bevendorff2019
Hinzugefügt am 2022-11-01
■ To obfuscate, we explore the huge space of textual variants in order to find a paraphrased version of the to-be-obfuscated text that has a sufficient Jensen-Shannon distance at minimal costs in terms of text quality loss. (p. 1)
■ Rulebased approaches are neither flexible, nor is stylometry understood well enough to compile rule sets that specifically target author style. Monolingual machine translation-based approaches suffer from a lack of training data, whereas applying multilingual translation in a cyclic manner as a workaround has proved to be ineffective. (p. 1)
■ Authorship analysis dates back over 120 years (Bourne, 1897) (p. 1)
■ Abbasi and Chen (2008) proposed writeprints, a set of over twenty lexical, syntactic, and structural text feature types, which has gained some notoriety within attribution, verification, but also for “anonymizing” texts (Zheng et al., 2006; Narayanan et al., 2012; Iqbal et al., 2008; McDonald et al., 2012). (p. 2)
■ Teahan and Harper (2003) and Khmelev and Teahan (2003) use compression as an indirect means to measure stylistic difference; later adapted and improved by Halvani et al. (2017). (p. 2)
- [b] authorship-attribution Compression
■ Koppel and Schler (2004) developed the unmasking approach based on the 250 most frequent function words, which are iteratively removed, effectively reducing the differentiability between the texts. (p. 2)
- [b] authorship-attribution Unmasking
■ Rao and Rohatgi (2000), who used cyclic machine translation (p. 2)
■ Xu et al. (2012) proposed within-language machine translation (p. 2)
■ Kacmarcik and Gamon (2006) directly targets Koppel and Schler’s unmasking (p. 2)
■ at the cost of rather unreadable texts (p. 2)
■ Our new authorship obfuscation approach is inspired by Stein et al. (2014)’s heuristic paraphrasing idea for “encoding” an acrostic in a given text and by Kacmarcik and Gamon’s observation that changing rather few text passages may successfully obfuscate authorship. (p. 2)
■ Assuming that the observed JS∆-to-length relationship generalizes to other text pairs of similar length—a hypothesis which merits further investigation in future work—, we measure style distance in JS∆@L (Jensen-Shannon distance at length) and fit threshold lines to define obfuscation levels. (p. 3)
■ The ε0 threshold serves as an obfuscation baseline, indicating a same-author case as unobfuscated, if the JS∆ between its documents is below this threshold. (p. 3)
■ The optimization goals can be summarized as follows: 1. Maximize the obfuscation as per the JS∆ beyond a given εk without “over-obfuscating.” 2. Minimize the accumulated text quality loss from consecutive paraphrasing operations. 3. Minimize the number of text operations. Heuristic search is our choice to tackle this optimization problem. (p. 4)
■ some authors’ texts are easier to be obfuscated than others (p. 5)
■ Given a longer text (one page or more), the number of potential operator applications is high. (p. 6)
■ the main challenge is to find a sensible middle ground between accepting a non-optimal solution too quickly or not finding a solution at all (p. 6)
■
■ Our experiments are based on PAN authorship corpora and our new Webis Authorship Verification Corpus 2019 of 262 authorship verification cases (Bevendorff et al., 2019), half of them sameauthor cases, the other half different-authors cases (each a pair of texts of about 23,000 characters / 4,000 words). (p. 6)
■ Since the greedy obfuscation approach cannot choose among different operators, it must rely on the most effective one to achieve the obfuscation goal, incurring significant path costs. (p. 7)
■ obfuscation operations need to be distributed across the whole text and progress needs to be measured on smaller parts of it to ensure uniform obfuscation of everything and avoid obfuscation “hot spots” (p. 8)
■
■ neural editing and paraphrasing (Grangier and Auli, 2017; Guu et al., 2017) (p. 9)
■ Our study opens up interesting avenues for future research: obfuscation by addition instead of by reduction, development of more powerful, targeted paraphrasing operators, and, theoretical analysis of the search space properties. (p. 9)