The SiBol/Port (Siena-Bologna, Portsmouth) corpus is a corpus of English broadsheet newspapers.
The corpus consists of 787,000 English newspaper articles from years 1993, 2005 and 2010. Newspapers included: The Times, The Guardian, The Daily Telegraph, The Sunday Times, The Sunday Telegraph.
The authors of the texts collection are Alan Scott Partington (Bologna University), John Morley (Siena University), Anna Marchi (Lancaster University), Charlotte Taylor (University of Sussex). See the SiBol group on Facebook
- raw textual data parsed into a documents structure
- tokenized using unitok with English model
- cleaned by removing duplicate documents using onion
- tagged by TreeTagger using Penn Treebank tagset, English parameter file (utf-8)
- compiled in the Sketch Engine using English sketch grammar for word sketches
(1 Dec 2011)
- recompiled, installed at the production server
v1.1 (9 Nov 2011)
- changed deduplication settings to “-n 7 -m” – 385 million tokens in 787,000 newspaper articles
- set name to “SiBol/Port” to better reflect the data collections included
v1.0 (31 October 2011)
- initial version – 332 million tokens in 643,000 newspaper articles