ANW: Algemeen Nederlands Woordenboek corpus

ANW: Algemeen Nederlands Woordenboek

The Algemeen Nederlands Woordenboek (ANW) is a Dutch corpus made up of texts from various domains. The ANW corpus is a balanced corpus of just over 100 million words which was compiled at the Institute for Dutch Lexicology (INL) and completed in 2004.

The ANW corpus comprises:

present-day literary texts (20%)
texts containing neologisms (5%)
texts of various domains in the Netherlands and Flanders (32%)
newspaper texts (40%)

The remainder is the ‘Pluscorpus’ which consists of texts, downloaded from the internet, with words that were present in an INL word list but absent in a first version of the corpus. To support searches by lemma and part of speech, the corpus has been annotated with lemmas and POS-tags using the technology which was originally developed for the Dutch PAROLE corpus (Does, Van der Voort van der Kleij 2002): a combination of statistical taggers including TnT3 and three taggers developed at the INL. Lemmatisation was a deterministic procedure, based on an extensive lexicon developed within INL.

More information about the corpus is available here (in Dutch).

Part-of-speech tagset

The ANW corpus was tagged with using the following POS tagset.

Access policy

The access to this corpus is restricted to only employees of the Institute for Dutch Lexicology (INL).

Available tools

A complete set of tools is available to work with this Dutch corpus to generate:

word sketch – Dutch collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Dutch nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context

Bibliography

Tiberius, Carole and Adam Kilgarriff (2009). The Sketch Engine for Dutch with the ANW corpus. In E. Beijk et al. (eds.). Fons Verborum: Feestbundel Fons Moerdijk. Amsterdam: Gopher BV., pp. 237–255.

Schoonheim, Tanneke and Rob Tempelaars (2010). Dutch Lexicography in Progress, The Algemeen Nederlands Woordenboek (ANW). In Anne Dykstra and Tanneke Schoonheim (eds.), Proceedings of the XIV Euralex International Congress. Ljouwert, Fryske Akademy/Afûk, abstract, pp. 718–725

Search the Dutch corpus

Sketch Engine offers a range of tools to work with this Dutch corpus.

about Sketch Engine

Other text corpora

Sketch Engine offers 400+ language corpora.

available corpora

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide