Data selection §
- SciSumm Dataset, version from 2019
- CL papers in XML format, each with gold summaries from human annotators as plaintext files
- citation information provided but not used
- original papers had annotations for abstract, sentence IDs, some were also split up into abstract and sections with title
- the first line of text/S element in both original files of the papers as well as in the summaries always contained the title of the paper
- compared to rule out processing errors
- some files had to be discarded due to full illegibility of the original paper/empty source files
- C94-2174:

- C02-1139:
- same thing for C04-1073
- C04-1046
- checked for empty titles based on the XPATH that matched most
- J05-1004 is missing its title in the source
- built a Dataframe to store the data preliminarily
- I need to preprocess the original files somewhat, clean them up