Chapter 2. The Body and the Web: An Introduction to the Web as Corpus

The main questions this paper tries to answer is what a corpus really is, whether the web is a corpus or whether it can be used as a source to create one and all the issues related to using content from the web and/or querying it directly.

📇 Index

  • 39 | The web is not a corpus because it is infinite and not designed for linguistic study
  • 40 | What is it good for vs is it even a corpus = two separate questions
  • 41 | The discussion raises a more fundamental question: should “corpus” be redefined?
  • 42 | Authenticity of texts extracted from the web
    • CONTRA CONTENT is not authentic
  • 43 | Representativenes
    • CONTRA The web can never be representative
      • only a subset of people on earth have access to the internet
        • only a subset of those people actually write something
        • corporations dominate
        • white people dominate
        • men dominate
      • language-use: not all languages equally represented, internet-speak
      • disproportionate
  • 44 | Data is not created for the corpus, it’s pre-existing
    • I could imagine that picking examples from such a huge resource creates room for biases

  1. Chapter 2. The Body and the Web: An Introduction to the Web as Corpus. In The Web As Corpus : Theory and Practice, 35–72. Bloomsbury Academic. 9781472542182.

On the one hand, the traditional notion of a linguistic corpus as a body of texts rests on some correlate issues, such as finite size, balance, part-whole relationship and permanence; on the other hand, the very idea of a web of texts brings about notions of non-finiteness, flexibility, de-centring/re-centring, and provisionality

‘why should anyone want to use other than carefully compiled corpora?’ (Hundt et al. 2007: 1)

Baroni and Bernardini (2006: 10–14) focused on four basic ways of conceiving of the web as/for corpus:

  • The Web as a corpus surrogate
  • The Web as a corpus shop
  • The Web as corpus proper
  • The mega-Corpus mini-Web…attempts to create a new object

The very idea of treating the web as a linguistic corpus presupposes a view of what a corpus is, and entails a redefinition of what a corpus can be.

A corpus is… well-designed and carefully constructed

McEnery and Wilson (following others before them) mix the question ‘What is a corpus?’ with ‘What is a good corpus (for certain kinds of linguistic study)?’, muddying the simple question ‘Is corpus x good for task y?’ with the semantic question ‘Is x a corpus at all?’.

it is therefore not far from the truth to state that the web itself, as a repository of huge amounts of authentic language in electronic format, freely available with little effort, has contributed to making the corpus linguistics approach so popular and accessible

✅ Definition

the web as a ‘spontaneous’, ‘self-generating’ collection of texts,

certain major areas of language use are under-represented

its status as a ‘representative sample’ remains non-existent: ‘ It is a textual universe of unfathomed extent and variety, but it can in no way be considered a representative sample of language use in general’ (Leech 2007: 145).

⭕ Caveats/Lookup

It is a basic assumption of corpus linguistics that all the language included in a corpus is authentic, and certainly the most prominent feature of the web to have attracted the linguist’s attention is its undisputable nature as a reservoir of authentic, purposeful language behaviour.
generated text, influences due to algorithmic favouring, “content creation”/influencers, PERFORMATIVITY

there would be no reason to turn to the web as an object of linguistic study were it not for its being comprised of authentic texts

Owing to its nature as an unplanned, unsupervised, unedited collection of texts, authenticity in the web is in fact often related to problems of ‘authoritativeness’.

in the words of Leech, it might seem that ‘the web as corpus makes the notion of a representative corpus redundant’ (2007: 144).