In linguistics, corpus (plural corpora) is
a large and structured set of texts (now usually electronically stored and
processed). A corpus may contain single texts in single language (monolingual
corpus) or text data in multiple languages (multilingual corpus). Multilingual
corpora that have been specially formatted for side-by-side comparison are called
aligned parallel corpora (Webster’s Online Dictionary). A corpus can be defined
as a collection of texts assumed to be representative of a given language put
together so that it can be used for linguistic analysis. Usually the assumption
is that the language stored in a corpus is naturally-occurring, that it is
gathered according to explicit design criteria, with a specific purpose in
mind, and with a claim to represent larger chunks of language selected
according to a specific typology. In general there is consensus that a corpus
deals with natural, authentic language. (Tognini-Bonelli, Corpus linguistics at
work, 2001:2)
A corpus is a collection of texts, designed
for some purpose, usually teaching or research. A corpus is not something that
a speaker does or knows, but something constructed by a researcher. It is a
record of performance, usually of many different users, and designed to be
studied, so that we can make inferences about typical language use. Because it
provides methods of observing patterns of a type which have long been sensed by
literary critics, but which have not been identified empirically, the computer-assisted
study of large corpora can perhaps suggest a way out of the paradoxes of
dualism. (Stubbs, Words and Phrases, 2002:239-40).
The expression Corpus Linguistics first
appeared in the early 80s.Corpus Linguistics is the study of
language/linguistic phenomena through the analysis of data obtained from a
corpus.Corpus Linguistics is now seen as the study of linguistics phenomena
through large collections of machine-readable texts which are called corpora.Since
corpus linguistics involves the use of large corpora that consist of millions
or sometimes even billion words, it relies heavily on the use of computers to
determine what rules govern the language and what patters (grammatical or
lexical for instance) occur. Thus it is not surprising that corpus linguistics
emerged in its modern form only after the computer revolution in the 1980s. The
Brown Corpus, the first modern and electronically readable corpus, however, was
created by Henry Kucera and W. Nelson Francis as early as the 1960s.
What Corpus Linguistics does is that it gives
an access to naturalistic linguistic information. As mentioned before, corpora
consist of “real word” texts which are mostly a product of real life
situations. This makes corpora a valuable research source for dialectology, sociolinguistics
and stylistics. It facilitates linguistic research. Electronically readable
corpora have dramatically reduced the time needed to find particular words or
phrases. A research that would take days or even years to complete manually can
be done in a matter of seconds with the highest degree of accuracy. Plus, it enables
the study of wider patterns and collocation of words. Before the advent of
computers, corpus linguistics was studying only single words and their
frequency. Modern technology allowed the study of wider patters and collocation
of words. Furthermore, it allows analysis of multiple parameters at the same
time. Various corpus linguistics software programs and analytical tools allow
the researchers to analyze a larger number of parameters simultaneously. In
addition, many corpora are enriched with various linguistic information such as
annotation.
Corpus Linguistics facilitates the study of
the second language. Study of the second language with the use of natural
language allows the students to get a better “feeling” for the language and
learn the language like it is used in real rather than “invented” situations.Corpus
linguistics studies the language by using randomly or systematically selected
corpora. They typically consist of a large number of naturally occurring texts,
however, they do not represent the entire language. Linguistic analyses that
use the methods and tools of corpus linguistics thus do not represent the
entire language.
No comments:
Post a Comment