SiBol: Corpus of English broadsheet newspapers 1993–2013

The English language newspapers corpus (SiBol) is a language corpus made up of articles collected from various English language newspapers of the years 1993–2013. The corpus contains around 650 million words in 1.5 million articles from 14 newspapers. The initial version of the corpus, containing UK broadsheets, was created in 2011 and was extended in 2017 to include newspapers from other countries including India, USA, Hong Kong, Nigeria and the Arab world, as well as UK tabloids. The corpus search can be restricted by a specific year, newspaper, author or date.

Part-of-speech tagset

The SiBol corpus was annotated by the TreeTagger tool using the Penn Treebank tagset with Sketch Engine modifications.

Authors

The SiBol corpus was compiled by a small team of linguistics researchers at the Universities of Siena and Bologna.

Content

See graphs describing the distribution of corpus texts according to years and newspaper titles.

Articles by year of publication

Articles by newspaper title

Tools to work with the SiBol corpus

A complete set of Sketch Engine tools is available to work with this broadsheet newspapers corpus to generate:

  • word sketch – English collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context

Changelog

version 2.1 (10 July 2017)

  • data added – 768,687 articles from 13 newspapers, including 9 new newspapers.
  • 9 new newspapers include: Daily Mirror, Daily Mail, The New York Times, Washington Post, This Day Lagos, Times of India, Gulf News, Daily News Egypt and South China Morning Post.
  • corpus updated using new English processing pipeline. The format of the corpus is now compatible with current user corpora.

version 1.1 (1 Dec 2011)

  • recompiled, installed at the production server

version 1.1 (9 Nov 2011)

  • changed deduplication settings to “-n 7 -m” – 385 million tokens in 787,000 newspaper articles
  • set name to “SiBol/Port” to better reflect the data collections included

version 1 (31 October 2011)

  • initial version – 332 million tokens in 643,000 newspaper article

Search the SiBol corpus

Sketch Engine offers a range of tools to work with the English broadsheet newspapers corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.