On the State of German (Abstractive) Text Summarization

  • 2 | definition of extractive vs abstractive
  • 2 | ROUGE is most used evaluation standard (n-gram co-occurences)
  • 3 | supposedly abstractive models often produce copied text
  • 5f | survey of all existing (or rather known) German or multilingual datasets and shared models
  • 6 | low reproducability
  • 8 | evaluation metrics: often compared to ONE gold standard, no multiple possible targets
  • 9f | data cleaning: empty, too short, duplicates
  • 12f | there are almost no extractive systems for German
  • 13 | evaluation metrics for extractive summ
    • lead bias: use the first X words of the text (different across domains)
  • 14ff | how are current baselines and state-of-the-art results influenced by the bad-quality data?
  • 19 | even in-domain samples from the same dataset can lead to huge semantic changes
  • 20 | conclusion
    • most models are unavailable to the public
    • there is a huge focus on news still
    • datasets are of bad quality, contain empty results or unparsed artifacts

Metadaten ( PDF)

  • Zusammenfassung::
  • Motivation::
  • Ergebnisse::


Imported: 2023-03-04 13:06

⭐ Main ideas

  • “We confirm that for the most popular training dataset, MLSUM, over 50% of the training set is unsuitable for abstractive summarization purposes.” (p. 1)
  • “Most works rely entirely on 𝑛-gram-based analysis of system summaries, such as ROUGE [Li04], which cannot accurately judge the truthfulness of a generated summary, i.e., how accurately the original text’s factual statements are represented in the generated summary.” (p. 2)
  • Key Finding 1: German subsets of two popular multilingual resources (MLSUM and MassiveSumm) have extreme data quality issues, affecting more than 25% of samples across all splits.” (p. 14)
  • Key Finding 2: Existing evaluation scores are hard (if not outright impossible) to reproduce, even with model weights publicly available.
    Key Finding 3: Authors frequently fail to put scores into context, not comparing their own results against baseline methods for further scrutiny.” (p. 15)
  • Key Finding 4: After filtering, scores can drop by more than 20 ROUGE-1 points on the MLSUM test set.” (p. 16)
  • Key Finding 5: With the exception of one work [Ak20], no publicly available system performs experiments beyond simple ROUGE score computation.
    Key Finding 6: Despite high reported scores, catastrophic failures can be observed in some systems.
    Key Finding 7: All utilized architectures only work with a relatively limited context, proving to be incapable of dealing with long-form summarization.” (p. 18)
  • “around half of the currently known German summarization systems still remain inaccessible to the public” (p. 20)
  • “a prominent focus on news summarization is still persisting” (p. 20)
  • “the most prominent dataset contains severe flaws in the sample quality” (p. 20)

✅ Useful

  • “extractive systems provide summaries by simply copying text snippets from the original input, which is efficient to compute, but comes at the cost of lower textual fluency. On the other hand, abstractive summarization systems may introduce new phrases, or even full sentences, which are not present in the original document.” (p. 2)
  • “Another key metric used in summarization research is the Compression Ratio (CR), defined as the relation between reference text length and summary length. We follow the definition by Grusky et al. [GNA18]: .” (p. 10)
  • “It can be argued, however, that samples with summaries longer (or equal) than their respective references (i.e., 𝐶 𝑅 1•0) always pose an inadequate sample and must be filtered.” (p. 10)
  • “extractive summaries are guaranteed to ensure a more factually consistent summary, and have high intra-sentence coherence” (p. 13)

📚 Investigate

  • “Aumiller, Dennis; Gertz, Michael: Klexikon: A German Dataset for Joint Summarization and Simplification. In: Proceedings of the Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, pp. 2693–2701, June 2022.” (p. 21)
  • Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022.” (p. 23)
  • “Mitchell, Margaret; Wu, Simone; Zaldivar, Andrew; Barnes, Parker; Vasserman, Lucy; Hutchinson, Ben; Spitzer, Elena; Raji, Inioluwa Deborah; Gebru, Timnit: Model Cards for Model Reporting. In (danah boyd; Morgenstern, Jamie H., eds): Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019. ACM, pp. 220–229, 2019.” (p. 23)