Problems in Current Text Simplification Research: New Data Can Help

Metadaten ( PDF)

  • Zusammenfassung:: Xu et al. aim to show how the most popular dataset in English text simplification, the Simple Wikipedia corpus, is unfit to be used for training or evaluation of simplification systems. Instead, they offer advice to the community about future research and create as well as evaluate the newsela-corpus.
  • Motivation:: The lack of proper, high-quality, diverse data
  • Ergebnisse:: newsela-corpus and insights into how experts adapt texts to conform with different reading grade levels


Imported: 2023-03-04 13:54

✅ Useful

  • So, roughly between 2010-2015, SimpleWikipedia was the main dataset. It still is used a lot, so I wonder if this is still true.:
    • datasets: “Simple Wikipedia has dominated simplification research in the past 5 years.” (p. 283)
  • “it can provide reading aids for people with disabilities (Carroll et al., 1999; Canning et al., 2000; Inui et al., 2003), low-literacy (Watanabe et al., 2009; De Belder and Moens, 2010), non-native backgrounds (Petersen and Ostendorf, 2007; Allen, 2009) or non-expert knowledge (Elhadad and Sutaria, 2007; Siddharthan and Katsos, 2010” (p. 283)
  • datasets: “The Parallel Wikipedia Simplification (PWKP) corpus (Zhu et al., 2010) contains approximately 108,000 automatically aligned sentence pairs from cross-linked articles between Simple and Normal English Wikipedia.” (p. 284)
  • “The most widely practiced evaluation methodology is to have human judges rate on grammaticality (or fluency), simplicity, and adequacy (or meaning preservation) on a 5-point Likert scale.” (p. 292)

⭐ Main ideas

  • “1) The Simple Wikipedia was created by volunteer contributors with no specific objective; 2) Very rarely are the simple articles complete re-writes of the regular articles in Wikipedia (Coster and Kauchak, 2011), which makes automatic sentence alignment errors worse; 3) As an encyclopedia, Wikipedia contains many difficult sentences with complex terminology.” (p. 285)
  • “even the simple side of the PWKP corpus contains an extensive English vocabulary of 78,009 unique words” (p. 285)
  • “we have assembled a new simplification dataset that consists of 1,130 news articles. Each article has been re-written 4 times for children at different grade levels by editors at Newsela” (p. 286)
  • “most recent automatic simplification systems are developed and evaluated with little consideration of target reader population” (p. 292)
  • “The Newsela corpus allows us to target children at different grade levels.” (p. 292)
  • “It is widely accepted that sentence simplification involves three different elements: splitting, deletion and paraphrasing (Feng, 2008; Narayan and Gardent, 2014).” (p. 293)

📚 Investigate