Chapter 2. The Body and the Web: An Introduction to the Web as Corpus

The main questions this paper tries to answer is what a corpus really is, whether the web is a corpus or whether it can be used as a source to create one and all the issues related to using content from the web and/or querying it directly.

Read: 2023-04-28

📇 Index

  • 39 | The web is not a corpus because it is infinite and not designed for linguistic study
  • 40 | What is it good for vs is it even a corpus = two separate questions
  • 41 | The discussion raises a more fundamental question: should “corpus” be redefined?
  • 42 | Authenticity of texts extracted from the web
    • CONTRA CONTENT is not authentic
  • 43 | Representativenes
    • CONTRA The web can never be representative
      • only a subset of people on earth have access to the internet
        • only a subset of those people actually write something
        • corporations dominate
        • white people dominate
        • men dominate
      • language-use: not all languages equally represented, internet-speak
      • disproportionate
  • 44 | Data is not created for the corpus, it’s pre-existing
    • I could imagine that picking examples from such a huge resource creates room for biases

Metadaten – PDF, PDF 2

  1. Chapter 2. The Body and the Web: An Introduction to the Web as Corpus. In The Web As Corpus : Theory and Practice, 35–72. Bloomsbury Academic. 9781472542182.

Imported: 2023-05-04 13:18

⭐ Main

Highlight ( p. 35)

On the one hand, the traditional notion of a linguistic corpus as a body of texts rests on some correlate issues, such as finite size, balance, part-whole relationship and permanence; on the other hand, the very idea of a web of texts brings about notions of non-finiteness, flexibility, de-centring/re-centring, and provisionality

Highlight ( p. 36)

‘why should anyone want to use other than carefully compiled corpora?’ (Hundt et al. 2007: 1)

Highlight ( p. 37)

Baroni and Bernardini (2006: 10–14) focused on four basic ways of conceiving of the web as/for corpus:

  • The Web as a corpus surrogate
  • The Web as a corpus shop
  • The Web as corpus proper
  • The mega-Corpus mini-Web…attempts to create a new object

Highlight ( p. 39)

The very idea of treating the web as a linguistic corpus presupposes a view of what a corpus is, and entails a redefinition of what a corpus can be.

Highlight ( p. 39)

A corpus is… well-designed and carefully constructed

Highlight ( p. 40)

McEnery and Wilson (following others before them) mix the question ‘What is a corpus?’ with ‘What is a good corpus (for certain kinds of linguistic study)?’, muddying the simple question ‘Is corpus x good for task y?’ with the semantic question ‘Is x a corpus at all?’.

Highlight ( p. 42)

it is therefore not far from the truth to state that the web itself, as a repository of huge amounts of authentic language in electronic format, freely available with little effort, has contributed to making the corpus linguistics approach so popular and accessible

✅ Definition

Highlight ( p. 41)

the web as a ‘spontaneous’, ‘self-generating’ collection of texts,

Highlight ( p. 44)

certain major areas of language use are under-represented

Highlight ( p. 44)

its status as a ‘representative sample’ remains non-existent: ‘ It is a textual universe of unfathomed extent and variety, but it can in no way be considered a representative sample of language use in general’ (Leech 2007: 145).

⭕ Caveats/Lookup

Highlight ( p. 42)

It is a basic assumption of corpus linguistics that all the language included in a corpus is authentic, and certainly the most prominent feature of the web to have attracted the linguist’s attention is its undisputable nature as a reservoir of authentic, purposeful language behaviour.
generated text, influences due to algorithmic favouring, “content creation”/influencers, PERFORMATIVITY

Highlight ( p. 42)

there would be no reason to turn to the web as an object of linguistic study were it not for its being comprised of authentic texts

Highlight ( p. 43)

Owing to its nature as an unplanned, unsupervised, unedited collection of texts, authenticity in the web is in fact often related to problems of ‘authoritativeness’.

Highlight ( p. 43)

in the words of Leech, it might seem that ‘the web as corpus makes the notion of a representative corpus redundant’ (2007: 144).