OANC: Open American National Corpus
The OANC-MASC Corpus
The Open American National Corpus (OANC) and its subcorpus The Manually Annotated Sub-Corpus (MASC) is a text corpus of American English. Texts in the corpus include all genres and transcripts of spoken data produced from 1990 onward. The whole corpus is comprised of 11 million words.
The MASC subcorpus consist of 480k words with manually validated annotations for sentence boundaries, tokens, lemmas, POS, noun, verb chunks, named entities (person, location, organization, date), coreference and discourse structure.
The OANC-MASC corpus contains merged data from OANC and MASC corpus. Because the MASC is a sub-corpus of OANC in the resulting OANC-MASC corpus the OANC’s MASC part was replaced by the MASC data to remove duplicated documents.
The OANC-MASC corpus has two separate parts: The OANC-MASC Written and The OANC-MASC Spoken part.
For more information visit http://www.anc.org
The enTenTen English corpora were tagged by TreeTagger using Penn TreeBank tagset.