motivation

  • managing-information-is-one-of-the-key-issues-of-todays-society
    • In a world where vast amounts of textual information are readily available to a large set of the human population, efficiently managing the storage, organisation, and retrieval of this data has become a key problem to address.
    • Natural Language Processing and Generation play an important role in providing access to the knowledge stored and woven into this sea of texts
      • NLP researchers serve an important role of actually helping people and developing narrowly-scoped applications, e.g. by improving on existing accommodations for disabled and neurodivergent people
    • huge prevalence of generative language models which are known to
  • supposing that the main goal of generating text is to provide trustworthy and reliable access to knowledge, it follows that training datasets should contain source attributions
    • Editing for attribution
    • for scientific articles in particular, researchers seem to be very careful to provide this attribution in abstracts but the situation becomes less clear in the rest of the document (^96d40c)
  • extractive AND abstractive methods
    • where extractive means that sequences were directly taken from the source document
  • single-document summarization
  • supervised learning

methodology

  • one-to-one sentence alignment

possible features

  • sentence length in characters
  • cosine similarity between vectors → word embeddings stacked for each sentence
  • sentiment similarity
  • word overlap, phrase overlap, bigram/trigram overlap
  • LDA/topic-modelling

potential improvements

  • creating a diff of which sequences overlap and which don’t even exist in the source
    • visibly annotating similarities/source references
    • creating a web interface to be used alongside generative language models to assess the validity of their output
  • multi-document summarization
  • many-to-many
    • PRO: fine-grained backtracing
    • CON:
  • ranking + human annotators could be used for reinforcement learning from human feedback (RLHF)
    • hypothesis: humans have an intuitive understanding of sentence alignment and may be able to better judge the semantic weight/relevance of specific word sequences in a sentence
  • (exploratory, interactive summarization?)