A Survey on Text Simplification.


As far as I can tell, this paper has never been published in ACM. The DOI used does not appear to belong to it. Yet it has been cited quite often.

Metadaten ( PDF)

  • Zusammenfassung:: This is a survey of text simplification approaches, datasets and evaluation metrics.
  • Motivation::
  • Ergebnisse:: The paper splits up the approaches into abstractive and extractive simplification. It then takes a closer look at lexical-simplification and novel text generation.


Imported: 2023-03-03 18:59

✅ Useful

  • “reducing the linguistic complexity of a text, while still retaining the original information and meaning” (p. 1111)
  • lexical-simplification pipeline:
  • “syntactic simplification seeks to identify grammatically complex text, and rewrite it so that it is easier to comprehend. This may involve splitting long sentences into shorter, more digestable chunks, changing passive voice usage to active, and resolving ambiguities and anaphora [76].” (p. 1122)

⭐ Main ideas

  • Extractive text simplification involves simply selecting sentences from a paragraph or document that convey the most “meaning.”” (p. 1113)
  • Abstractive text simplification involves generation of new and novel text, which is lexically and/or syntactically simpler than the original. Abstractive approaches have mostly focused on lexical or phrasal substitutions for sentence-level simplification [63].” (p. 1114)
  • lexical-simplification (LS), first introduced in the work of Devlin and Tait (1998) [25], is a form of abstractive TS that aims to replace complex words, which may challenge certain audience, by simpler alternatives [60].” (p. 1115)
  • “starting with Complex Word Identification, and followed by Substitution Generation, Substitution Selection and finally Substitution Ranking” (p. 1115)
  • “CWI involves determining which words to simplify, given a target audience” (p. 1115)
  • “The goal of SG is to generate possible candidates for replacement of complex words.” (p. 1118)
  • “The aim of SS is to choose which of candidates produced during SG would fit the context of the sentence being simplified, with respect to grammatical construction and meaning” (p. 1119)
  • “Substitution Ranking (SR)…involves ranking and deciding which of the substitutes generated produce the simplest output, in the given context” (p. 1120)
  • “automatic creation of rewrite rules for simplifying text” (p. 1122)
  • “The text is first analyzed to identify its structure and create a parse tree.” (p. 1122)
  • “In the transformation phase, the parse tree is modified according to a set of rewrite rules, which perform the simplification operations, such as sentence splitting [25], clause rearrangement [78] and clause dropping [83].” (p. 1122)
  • “a regeneration phase may also be carried out, during which further modifications are made to the text to improve cohesion, relevance and readability” (p. 1122)
  • “Statistical machine-translation (SMT)” (p. 1123)
  • “converting the problem of simplification to a case of monolingual text-to-text generation” (p. 1123)
  • “6.1 Evaluating Novel Text Generation” (p. 1129)
  • SARI - Developed by Xu et al. in 2016, SARI is currently the main metric used for simplification model [101]” (p. 1129)
  • “this metric requires several simplified reference sentences to be compared against” (p. 1129)
  • “Readability Indices” (p. 1129)
  • Flesch Reading Ease, Flesch Kincaid Grade (based on words, sentences and syllables), Coleman Liau (relies on characters per word), Automated Readability Index (based on characters, words and sentences), Linsear Write Formula (based on “easy” and “hard” words per sentence), and Gunning FOG (based on regular and complex words per sentence)” (p. 1129)
  • “SAMSA - A new metric proposed by Sulem et al.” (p. 1130)

🧩 Methodology

  • “There were two significant issues with early seq2seq models” (p. 1124)
  • “Reproduce inaccurate output” (p. 1124)
  • “Repetitions in output” (p. 1124)

📚 Investigate

  • “David Klaper, Sarah Ebling, and Martin Volk. 2013. Building a German/Simple German Parallel Corpus for Automatic Text Simplification. Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations (2013), 11–19. W13-2902” (p. 1132)