Data selection

  • SciSumm Dataset, version from 2019
    • CL papers in XML format, each with gold summaries from human annotators as plaintext files
    • citation information provided but not used
    • original papers had annotations for abstract, sentence IDs, some were also split up into abstract and sections with title
    • the first line of text/S element in both original files of the papers as well as in the summaries always contained the title of the paper
      • compared to rule out processing errors
      • some files had to be discarded due to full illegibility of the original paper/empty source files
        • C94-2174:
        • C02-1139:
          • same thing for C04-1073
          • C04-1046
    • checked for empty titles based on the XPATH that matched most
      • J05-1004 is missing its title in the source
    • built a Dataframe to store the data preliminarily
    • I need to preprocess the original files somewhat, clean them up