The COMPAS is a corpus with about 100 million words, which was compiled and completed in early 2014. It comprises of the content published in newspaper dailies related to immigration information during 2006–2013 (both years inclusive). The documents in the corpus contain the following meta fields:
- date – In the form of yyyy-mm-dd
- publication – Name of the publication from where the text is taken
- title – Title of the article
- month – Contains the month in which the content was posted.
- language – English ( this is the case for all the articles )
- year – Contains the year in which the content was posted.
- quarter – Contains information about the quarter of the year in which it was posted. represented by q1,q2,q3 and q4.
To support searches by lemma and part of speech, the corpus has been annotated with lemmas and POS-tags.
TAGSET information – Penn Treebank tagset
Sketch Grammar – English PennTB-TreeTagger 2.5