Friday, 8 March 2013

Lecture 8 on Corpus Linguistics


In linguistics, corpus (plural corpora) is a large and structured set of texts (now usually electronically stored and processed). A corpus may contain single texts in single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora (Webster’s Online Dictionary). A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that it is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent larger chunks of language selected according to a specific typology. In general there is consensus that a corpus deals with natural, authentic language. (Tognini-Bonelli, Corpus linguistics at work, 2001:2)
A corpus is a collection of texts, designed for some purpose, usually teaching or research. A corpus is not something that a speaker does or knows, but something constructed by a researcher. It is a record of performance, usually of many different users, and designed to be studied, so that we can make inferences about typical language use. Because it provides methods of observing patterns of a type which have long been sensed by literary critics, but which have not been identified empirically, the computer-assisted study of large corpora can perhaps suggest a way out of the paradoxes of dualism. (Stubbs, Words and Phrases, 2002:239-40).
The expression Corpus Linguistics first appeared in the early 80s.Corpus Linguistics is the study of language/linguistic phenomena through the analysis of data obtained from a corpus.Corpus Linguistics is now seen as the study of linguistics phenomena through large collections of machine-readable texts which are called corpora.Since corpus linguistics involves the use of large corpora that consist of millions or sometimes even billion words, it relies heavily on the use of computers to determine what rules govern the language and what patters (grammatical or lexical for instance) occur. Thus it is not surprising that corpus linguistics emerged in its modern form only after the computer revolution in the 1980s. The Brown Corpus, the first modern and electronically readable corpus, however, was created by Henry Kucera and W. Nelson Francis as early as the 1960s.
What Corpus Linguistics does is that it gives an access to naturalistic linguistic information. As mentioned before, corpora consist of “real word” texts which are mostly a product of real life situations. This makes corpora a valuable research source for dialectology, sociolinguistics and stylistics. It facilitates linguistic research. Electronically readable corpora have dramatically reduced the time needed to find particular words or phrases. A research that would take days or even years to complete manually can be done in a matter of seconds with the highest degree of accuracy. Plus, it enables the study of wider patterns and collocation of words. Before the advent of computers, corpus linguistics was studying only single words and their frequency. Modern technology allowed the study of wider patters and collocation of words. Furthermore, it allows analysis of multiple parameters at the same time. Various corpus linguistics software programs and analytical tools allow the researchers to analyze a larger number of parameters simultaneously. In addition, many corpora are enriched with various linguistic information such as annotation.
Corpus Linguistics facilitates the study of the second language. Study of the second language with the use of natural language allows the students to get a better “feeling” for the language and learn the language like it is used in real rather than “invented” situations.Corpus linguistics studies the language by using randomly or systematically selected corpora. They typically consist of a large number of naturally occurring texts, however, they do not represent the entire language. Linguistic analyses that use the methods and tools of corpus linguistics thus do not represent the entire language.

No comments:

Post a Comment