motivation §
- managing-information-is-one-of-the-key-issues-of-todays-society
- In a world where vast amounts of textual information are readily available to a large set of the human population, efficiently managing the storage, organisation, and retrieval of this data has become a key problem to address.
- Natural Language Processing and Generation play an important role in providing access to the knowledge stored and woven into this sea of texts
- NLP researchers serve an important role of actually helping people and developing narrowly-scoped applications, e.g. by improving on existing accommodations for disabled and neurodivergent people
- huge prevalence of generative language models which are known to
- supposing that the main goal of generating text is to provide trustworthy and reliable access to knowledge, it follows that training datasets should contain source attributions
- Editing for attribution
- for scientific articles in particular, researchers seem to be very careful to provide this attribution in abstracts but the situation becomes less clear in the rest of the document (^96d40c)
- extractive AND abstractive methods
- where extractive means that sequences were directly taken from the source document
- single-document summarization
- supervised learning
methodology §
- one-to-one sentence alignment
possible features §
- sentence length in characters
- cosine similarity between vectors → word embeddings stacked for each sentence
- sentiment similarity
- word overlap, phrase overlap, bigram/trigram overlap
- LDA/topic-modelling
potential improvements §
- creating a diff of which sequences overlap and which don’t even exist in the source
- visibly annotating similarities/source references
- creating a web interface to be used alongside generative language models to assess the validity of their output
- multi-document summarization
- many-to-many
- PRO: fine-grained backtracing
- CON:
- ranking + human annotators could be used for reinforcement learning from human feedback (RLHF)
- hypothesis: humans have an intuitive understanding of sentence alignment and may be able to better judge the semantic weight/relevance of specific word sequences in a sentence
- (exploratory, interactive summarization?)