Library & Cultural Services: Linguistics: Computational Linguistics. Corpora. Data

Computational Linguistics. Corpora. Data

ACL Anthology
Maintained by the Association of Computational Linguistics, hosts thousands of papers on computational linguistics and natural language processing
BYU Law & Corpus Linguistics
Brigham Young University projects that offer a number of specialist corpora, primarily covering US legal material
British National Corpus
100-million word collection of written and spoken modern British English representing a "unique snapshot of the English language". The data was collected between 1991 and 1994 and represents a wide variety of mostly written (90%) and spoken language.
Corpus of Contempory American English (COCA)
The corpus contains more than one billion words of text (1990-) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages
English-Corpora.org
Useful links to some major English language corpora
Essex Corpus Linguistics Collective
Provides information about corpora at Essex, and links to corpus tools and corpus resources.
Linguistic Data Consortium
Vast resource of corpora, data, software & research papers hosted by the University of Pennsylvania. Essex has purchased access to a very limited number of corpora
OPUS Open Parallel Corpora
EU collaborative project to promote availability of open parallel corpora resources
Positive Lexicography Project
An index with words in different languages for happiness
Re3data
Global registry of research data repositories, all subject areas, including linguistics
WebCorpLive
Produced by the University of Liverpool, this site offers a suite of tools which allows access to the World Wide Web as a corpus. It can aid research on how particular words and phrases are used, especially those which are too new or too rare to appear in any dictionary or standard corpus.

Data in Specific Languages

Chinese Text Corpus
Open-access digital library of pre-modern Chinese texts. The site makes use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is the largest database of pre-modern Chinese texts