Chapter 2. The Body and the Web: An Introduction to the Web as Corpus
The main questions this paper tries to answer is what a corpus really is, whether the web is a corpus or whether it can be used as a source to create one and all the issues related to using content from the web and/or querying it directly.
Read: 2023-04-28
📇 Index
- 39 | The web is not a corpus because it is infinite and not designed for linguistic study
- 40 | What is it good for vs is it even a corpus = two separate questions
- 41 | The discussion raises a more fundamental question: should “corpus” be redefined?
- 42 | Authenticity of texts extracted from the web
- CONTRA CONTENT is not authentic
- 43 | Representativenes
- CONTRA The web can never be representative
- only a subset of people on earth have access to the internet
- only a subset of those people actually write something
- corporations dominate
- white people dominate
- men dominate
- language-use: not all languages equally represented, internet-speak
- disproportionate
- only a subset of people on earth have access to the internet
- CONTRA The web can never be representative
- 44 | Data is not created for the corpus, it’s pre-existing
- I could imagine that picking examples from such a huge resource creates room for biases
Some thoughts from the lecture
Chapter 2. The Body and the Web: An Introduction to the Web as Corpus. In The Web As Corpus : Theory and Practice, 35–72. Bloomsbury Academic. 9781472542182.
- Keywords:: linguistics, Corpus linguistics
- Zotero öffnen
Imported: 2023-05-04 13:18
⭐ Main
Highlight ( p. 35)
On the one hand, the traditional notion of a linguistic corpus as a body of texts rests on some correlate issues, such as finite size, balance, part-whole relationship and permanence; on the other hand, the very idea of a web of texts brings about notions of non-finiteness, flexibility, de-centring/re-centring, and provisionality
Highlight ( p. 36)
‘why should anyone want to use other than carefully compiled corpora?’ (Hundt et al. 2007: 1)
Highlight ( p. 37)
Baroni and Bernardini (2006: 10–14) focused on four basic ways of conceiving of the web as/for corpus:
- The Web as a corpus surrogate
- The Web as a corpus shop
- The Web as corpus proper
- The mega-Corpus mini-Web…attempts to create a new object
Highlight ( p. 39)
The very idea of treating the web as a linguistic corpus presupposes a view of what a corpus is, and entails a redefinition of what a corpus can be.
Highlight ( p. 39)
A corpus is… well-designed and carefully constructed
Highlight ( p. 40)
McEnery and Wilson (following others before them) mix the question ‘What is a corpus?’ with ‘What is a good corpus (for certain kinds of linguistic study)?’, muddying the simple question ‘Is corpus x good for task y?’ with the semantic question ‘Is x a corpus at all?’.
Highlight ( p. 42)
it is therefore not far from the truth to state that the web itself, as a repository of huge amounts of authentic language in electronic format, freely available with little effort, has contributed to making the corpus linguistics approach so popular and accessible
✅ Definition
Highlight ( p. 41)
the web as a ‘spontaneous’, ‘self-generating’ collection of texts,
Highlight ( p. 44)
certain major areas of language use are under-represented
Highlight ( p. 44)
its status as a ‘representative sample’ remains non-existent: ‘ It is a textual universe of unfathomed extent and variety, but it can in no way be considered a representative sample of language use in general’ (Leech 2007: 145).
⭕ Caveats/Lookup
Highlight ( p. 42)
It is a basic assumption of corpus linguistics that all the language included in a corpus is authentic, and certainly the most prominent feature of the web to have attracted the linguist’s attention is its undisputable nature as a reservoir of authentic, purposeful language behaviour.
generated text, influences due to algorithmic favouring, “content creation”/influencers, PERFORMATIVITY
Highlight ( p. 42)
there would be no reason to turn to the web as an object of linguistic study were it not for its being comprised of authentic texts
Highlight ( p. 43)
Owing to its nature as an unplanned, unsupervised, unedited collection of texts, authenticity in the web is in fact often related to problems of ‘authoritativeness’.
Highlight ( p. 43)
in the words of Leech, it might seem that ‘the web as corpus makes the notion of a representative corpus redundant’ (2007: 144).