The (un)suitability of automatic evaluation metrics for text simplification

  • Code:
  • 862 | most common evaluation metrics for simplification only look at ONE angle of multiple possible changes made to a text
    • SARI → lexical paraphrasing
    • “multi-operation simplification”
  • 884 | new dataset for evaluation, 600 lines from six systems
    • Direct Assessment method to crowdsource ratings on fluency, meaning preservation and simplicity
  • 884 | low-quality might be easier to spot with basic metrics but no high correlation, metrics work better with neural models
  • 885 | Recommendation: BERTScore should be used first

Metadaten ( PDF)

  • Zusammenfassung:: Examining common evaluation metrics in text-simplification to show which ones might be not suitable for the task. Figuring out which metrics are applicable for which specific perspective and dataset. Introducing new metrics.
  • Motivation:: Common metric in this field were either developed for machine translation tasks OR the assessment of humans.
  • Ergebnisse:: BERTScore is the recommended metric for simplification and should be applied first.


Imported: 2023-03-16 14:49

⭐ Main ideas

  • “Alva-Manchego et al. (2020) showed that, for the same set of original sentences, human judges preferred manual simplifications where multiple edit operations had been applied over those where only one operation had been performed (i.e., only lexical paraphrasing or only splitting).” (p. 862)
  • “we: (1) create a new data set with direct assessments of simplicity; (2) perform the first meta-evaluation of automatic metrics for sentence-level Text Simplification, focused on their correlation with human judgments on simplicity; and (3) propose a set of guidelines for automatic evaluation of sentence-level simplifications, seeking to improve the interpretation of automatic scores, especially for multi-operation simplifications” (p. 862)
  • “the first meta-evaluation study of automatic metrics in Sentence Simplification” (p. 884)

📚 Investigate

  • “Simplicity Gain data set” (p. 884)