The SiBol/Port (Siena-Bologna, Portsmouth) corpus is a corpus of English broadsheet newspapers.
The corpus consists of 1,565,905 English newspaper articles from the years 1993, 2005, 2010 and 2013.
Newspapers included: The Times, The Guardian, The Daily Telegraph, The Sunday Times, The Sunday Telegraph, Daily Mirror, Daily Mail, The New York Times, Washington Post, This Day Lagos, Times of India, Gulf News, Daily News Egypt and South China Morning Post.
The authors of the texts collection are Alan Scott Partington (Bologna University), John Morley (Siena University), Anna Marchi (Lancaster University), Charlotte Taylor (University of Sussex). See the SiBol group on Facebook
- raw textual data parsed into a documents structure
- tokenized using unitok with English model
- cleaned by removing duplicate documents using onion
- tagged by TreeTagger using Penn Treebank tagset, English parameter file (utf-8)
- compiled in the Sketch Engine using English sketch grammar for word sketches
(1 Dec 2011)
- recompiled, installed at the production server
v1.1 (9 Nov 2011)
- changed deduplication settings to “-n 7 -m” – 385 million tokens in 787,000 newspaper articles
- set name to “SiBol/Port” to better reflect the data collections included
v1.0 (31 October 2011)
- initial version – 332 million tokens in 643,000 newspaper article
v2.1 (10 July 2017)
- data added – 768,687 articles from 13 newspapers, including 9 new newspapers.
- 9 new newspapers include: Daily Mirror, Daily Mail, The New York Times, Washington Post, This Day Lagos, Times of India, Gulf News, Daily News Egypt and South China Morning Post.
- corpus updated using new English processing pipeline. Format is now compatible with current user corpora.