NCI: New Corpus for Ireland

The New Corpus for Ireland (NCI) is a language corpus developed as part of the set-up phase of a project for a new English-to-Irish Dictionary (NEID). The project is under the direction of Foras na Gaeilge, a public body responsible for the promotion of the Irish language.

The corpus was collected in three main ways:

  • incorporating existing corpora
  • contacting publishers, authors, newspaper companies, etc. to request permission to use their texts
  • collecting data from the web.

In Sketch Engine, the project is composed of two separate corpora:

  • 30-million-word corpus of Irish
  • 200-million-word corpus of English including Hiberno-English (the variety of English that is spoken in Ireland)

Part-of-speech tagset

The NCI corpus, the Irish part, was processed by the morphological analyzer/generator for Irish (Uı´ Dhonn chadha) with the following POS tagset. The English part of the NCI was tagged by TreeTagger using Penn Treebank tagset.

Tools to work with the New Corpus for Ireland

A complete set of Sketch Engine tools is available to work with this NCI corpus to generate:

  • word sketch – Irish collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Irish nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords – terminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

Kilgarriff, Adam, Michael Rundell, and Elaine Uí Dhonnchadha. Efficient corpus development for lexicography: building the New Corpus for IrelandLanguage resources and evaluation 40.2 (2006): 127-152.

Search the New Corpus for Ireland

Sketch Engine offers a range of tools to work with the New Corpus for Ireland.

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.