SiBol: Corpus of English broadsheet newspapers 1993–2013
The English language newspapers corpus (SiBol) is a language corpus made up of articles collected from various English language newspapers of the years 1993–2013. The corpus contains around 650 million words in 1.5 million articles from 14 newspapers. The initial version of the corpus, containing UK broadsheets, was created in 2011 and was extended in 2017 to include newspapers from other countries including India, USA, Hong Kong, Nigeria and the Arab world, as well as UK tabloids. The corpus search can be restricted by a specific year, newspaper, author or date.
The SiBol corpus was annotated by the TreeTagger tool using the Penn Treebank tagset with Sketch Engine modifications.