CLUE Department of Computational Linguistics at the University of Erlangen

Computer Corpora and Corpus-Based Linguistics

On-line corpora are large quantities of natural text stored in computers to be object of some kind of linguistic research. `Natural' means that anything that actually has been uttered, be it written or spoken, may be included.

Corpus-based linguistics avoids the use of invented examples in favour of real language, as collected in computer corpora.
Before using a corpus, a linguist must ensure that the corpus is representative with respect to the linguistic phenomenon under investigation.

I would like to avoid the term `corpus linguistics' and rather speak of two disciplines: on the one hand, computer corpora are resources established by natural language engineering, and on the other hand corpus-based linguistics uses these resources to establish (hopefully) epirically valid linguistic theories. This terminological division reflects the fact that the former discipline is technological and the latter scientific.

Corpus Holdings

Projects

Other Useful Links


Jochen Leidner, 1998-04-29