Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias
- 60 | Describing and solving common problems which can negatively impact the usefulness of a corpus
- The problems are: ambiguity, variation, uncertainty, error and bias
- 60 | The annotation process requires a lot of effort. When creating a corpus, a balance between linguistic (theoretical) accuracy and accessibility (both from users as well as for processing tools) has to be found.
- ! Important to keep in mind: There is no perfect annotation. There will always be a certain degree of uncertainty and inter-annotator disagreement.
- For a corpus to be useful, we must consider several possible pitfalls which can degrade the quality of the results we’re trying to accomplish
- Ambiguity and variation are inherent properties of linguistic data
- Uncertainty develops during the annotation process, when an annotator is lacking contextual information or specific knowledge
- Errors are introduced in the annotations, attributions and misspellings
- Biases are “property of the complete annotation system”
- So, what should we do if annotation is so fraught with issues?
- 62 | In several studies, researchers tried to introduce some kind of qualitative metric to help annotators decide between two or more possible annotations
- Reflect whether it is more likely that option A or option B occurs in the data and in the context you’re finding it in
- Rating their own confidence in their decision
- Ultimately, what happens is that researchers choose one of the following three options
- take stochastic measures → likelihood
- assign to other/misc categories
- leave parts of the data unannotated
- INFO Previous work hasn’t applied a universal framework, rather the authors added patches for various very specific questions and issues. Beck et al. try to solve this.
- In their paper, they first try to […]
- 62-63 | Three stages of corpus development
- Phase I: Data Selection and Processing
- Digitization, Normalization
- Phase II: Data Annotation
- What needs to be annotated?
- How can we classify our annotations? Which labels should we use?
- Phase III: Interpretation
- These steps inform each other. Interpretation and inference are only possible if the data is properly selected and prepared.
- Phase I: Data Selection and Processing
- INFO Some ways to combat this is to a) define a common annotation guideline, used by every annotator, and b) regular review and interative correction.
Metadaten – PDF
Beck, Christin, Hannah Booth, Mennatallah El-Assady & Miriam Butt. 2020. Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias. In Proceedings of the 14th Linguistic Annotation Workshop, 60–73. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/2020.law-1.6. (19 April, 2023).
Imported: 2023-05-06 14:45
we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias
Note ( p. 60)
Annotating data, especially historical records, always leads to some degree of uncertainty and inter-annotator disagreement.
developing a robust framework which explicitly treats these problems
Representation problems in linguistic annotations come from five distinct sources: (i) Ambiguities are an inherent property of the data. (ii) Variation is also part of the data and can, e.g., occur across documents. (iii) Uncertainty is introduced by an annotator’s lack of knowledge or information. (iv) Errors can be found in the annotations. (v) Biases are a property of the complete annotation system.
Existing approaches typically treat representation problems in one of three ways in linguistic annotation processes: (i) stochastic treatment, (ii) assignment of an ‘other/miscellaneous’ category, (iii) left unannotated…among the possibilities the option which is most likely is chosen…mark entities about whose interpretation an annotator is uncertain with a specific tag (‘other/miscellaneous’ category), signaling that no adequate annotation is available…third option often employed is to leave uncertain material unannotated
We see the important barrier to overcome here as a conceptual one, in the sense that concepts like ‘uncertainty’ and ‘ambiguity’ are often used interchangeably, despite there being inherent differences between the sources of these representation problems.
Previous work hasn’t applied a universal framework, rather the authors added patches for various very specific questiins and issues.Beck et al. try to sove this.
The corpus development workflow consists of data selection and processing, typically including digitization, normalization and automatic pre-processing, and cycles of annotation. We see two major parts of a typical corpus development process: Data Selection and Processing (Phase I) and Annotation (Phase II). Following the corpus development, there is a third part, Interpretation (Phase III), which pertains to corpus use, addressing the issue of interpreting the annotated data.
these steps all inform each other
Ensuring a balanced corpus
digitization of any non-digitized textual material, spelling normalization and an automatic pre-processing
The manual annotation process generally consists of iterative cycles of annotation, evaluation and error correction.
combining it with machine learning
First, the linguistic unit which is to be annotated has to be identified. The classification task then deals with assigning an annotation label to the previously identified linguistic unit.
Without explicit treatment, representation problems often persist in post-hoc analyses of the annotations, rendering linguistic findings and computational models potentially unreliable or even misleading
Any manual annotation process represents a compromise between an accurate linguistic analysis and an annotation scheme which is generalizable enough to serve computational tools and the end user.
Image ( p. 61)
Merten and Seemann (2018) have developed a novel interface which enables an annotator to capture different sources of uncertainty while annotating POS…(i) category A is more likely than B, (ii) A and B are equally likely, (iii) unsure
Likert-scale based approach