PICAE: Pearson International Corpus of Academic English

The Pearson International Corpus of Academic English (PICAE) is a language corpus made up of texts collected from the Internet. PICAE comprises over 37 million words including 13 % spoken and 87 % written material covering American, Australian, British, Canadian and New Zealand English. Corpus texts include a wide range of academic subjects the four main academic disciplines, namely humanities, social science, natural & formal science and professions & applied sciences. Furthermore, it also comprises lectures, seminars, textbooks and journal articles at undergraduate as well as postgraduate levels, university administrative material, university magazines, TV and radio broadcasts, etc.

Data of the PICAE corpus was gathered from five different sources:

  • 19.6 million words from the World Wide Web
  • 12.1 million words from the Longman Higher Education textbooks
  • 0.7 million words from the Longman Spoken American Corpus
  • 4.4 million words from the British National Corpus
  • 0.4 million words of academic English from the American National Corpus

Material was also taken from the academic sections of the British National Corpus which comprises 56 articles from 13 different academic disciplines, e.g., literature, art, chemistry published between 1975 and 1993.

The corpus was launched at IATEFL 2009, a full report is available at http://pearsonpte.com/wp-content/uploads/2014/07/RS_PICAE_2010.pdf

Part-of-speech tagset

The PICAE corpus is POS tagged by TreeTagger using the Penn Treebank tagset.

Access policy

To obtain authorisation from Pearson to access the corpus:

  1. please contact Veronica Benigno veronica.benigno@pearson.com. Provide a brief description of your research and state your academic affiliation.
  2. Then get in touch with Sketch Engine at support@sketchengine.co.uk who will update your account permissions accordingly.

Tools to work with the PICAE corpus

A complete set of Sketch Engine tools is available to work with this English Academic corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 3 (March 2017)

  • corpus tagged by the RFTagger tool with the NKJP tagset
  • created lempos

version 2 (1 July 2013)

  • corpus tagged by the WCRFT tagger

version 1 (23 July 2012)

  • initial version – 7.7 billion words, untagged

a sample for Cesar (25 October 2012)

Bibliography

Ackermann, K., De Jong, J. H. A. L., Kilgarriff, A., & Tugwell, D. (2011). The Pearson International Corpus of Academic English (PICAE). In Proceedings of Corpus Linguistics.

Search the English PICAE corpus

Sketch Engine offers a range of tools to work with the Pearson International Corpus of Academic English.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.