Assessing Readability by Filling Cloze Items with Transformers

  • 307 | Readability can be seen as a Masked language modelling task, using well-known cloze tests as the basis
    • Real-word readability assessment based on cloze tests is called "nth-deletion"
    • Every nth word is deleted and replaced with blank of fixed size
    • More correct completions = easier to read
  • 308 | Paradoxical but true: Simpler measures capture readability quite well, across genres and especially on lengthy documents
  • 309 | Using T5-large to mask randomly over first 250 words, then predicting multiple substitutions
  • 310 | Three datasets: Bormuth passages, Newsela, OneStopEnglish
  • 317 | Claim: Using a pretrained model is similar to a human interpreting a text since there is additional "world knowledge" available
    • 👍 Pro: certain phrases are well-known to humans just as much due to probability as for the model 🤷
    • 👎 Contra: the octopus paper @bender2020
      • No extrapolation possible because no understanding!

Metadaten ( PDF)

  • Zusammenfassung:: Compares cloze readability to other, perhaps superficial, readability metrics like Flesch-Kincaid Reading Ease
  • Motivation:: “Reproduce” Bormuth’s theories about cloze tests for readability measurements using SOTA DL models (T5-large)
  • Ergebnisse:: T5 Cloze results were more similar to human scores than supposed grade levels which suggests it’s a more “natural approach”
    • could be attributed to the fact that T5 will have learned similar contexts
    • OSE scores were really bad though, neither correlating with human scoring nor reading ease scores
    • no transferability!


Imported: 2023-03-03 14:52

⭐ Main ideas

  • “The standard approach to assessing readability with cloze items is called nth deletion, where every nth word in a text is deleted and replaced with a blank of fixed size.” (p. 307)
  • “The seeming paradox that the simplest measures would be the best predictors of readability was addressed by Bormuth, who described it as a trade-off between face validity and predictive validity [3]: many linguistic variables correlate with readability, so a metric with face validity would include many linguistic variables; however, the measurement error associated with these variables means that a metric with fewer variables has better predictive power when applied to unseen texts.” (p. 308)
  • “T5 cloze scores potentially have some additive benefit to ASL and AWL and can be used almost interchangeably for this task” (p. 312)
  • “T5 cloze scores were much more strongly correlated with human cloze scores (study 1) than with expert-assigned grade levels (studies 2 and 3)” (p. 317)
  • "This capability is analogous to a human reader bringing to bear background knowledge in order to understand a text, and it is something that isn’t captured by wordor sentence-length metrics." (p. 317)

✅ Useful

  • “It has long been known that a higher number of correct completions on nth deletion cloze tests is a strong indicator of higher readability (low difficulty) and aligns with well-known readability metrics like Flesch reading ease and DaleChall readability, aptitude tests, and standard comprehension questions [2,4, 16,17].” (p. 307)

📚 Investigate

  • [b] “The idea of using Transformers to directly measure cloze difficulty was first investigated by Benzahra & Yvon, unfortunately without much success [1].” (p. 308)

🧩 Methodology

  • “All of the following studies use a Transformer called T5 [13]” (p. 309)
  • 👎 “only the first 250 words of each text are subjected to nth deletion (n = 5)” (p. 309)
  • “Five clozed versions of each text” (p. 309)
    • “different offsets” (p. 309)
  • “it was discovered that the T5 model used produces degenerate responses to cloze items after the 27th item. Therefore, each text was split into two chunks, each representing 125 words and 25 cloze items” (p. 309)
  • “Correct at rank 1 was defined by an exact match between the top prediction and the original word, normalized for case and leading/trailing whitespace.” (p. 310)
  • “all spelling errors were corrected” (p. 310)