The OANC-MASC Corpus

The Open American National Corpus (OANC) and its sub-corpus  The Manually Annotated Sub-Corpus (MASC). In addition to OANC, the MASC corpus includes manually validated annotations for sentence boundaries, tokens, lemmas, POS, noun, verb chunks, named entities (person, location, organization, date), Penn Treebank syntax, coreference and discourse structure.

The OANC-MASC corpus contains merged data from OANC and MASC corpus. Because the MASC is a sub-corpus of OANC in the resulting OANC-MASC corpus the OANC’s MASC part was replaced by the MASC data to remove duplicated documents.

The OANC-MASC corpus has two separate parts: The OANC-MASC Writen  and The OANC-MASC Spoken part.

According to metadata doc.corp, user can specify which corpus part (OANC or MASC) will be searched. The doc.domain attribute specifies the text domain.

