The Algemeen Nederlands Woordenboek (ANW) corpus is a balanced corpus of just over 100 million words which was compiled at the Institute of Dutch Lexicology (INL) and completed in 2004.

It comprises:

  • present-day literary texts (20%)
  • texts containing neologisms (5%)
  • texts of various domains in the Netherlands and Flanders (32%)
  • newspaper texts (40%)

The remainder is the ‘Pluscorpus’ which consists of texts, downloaded from the internet, with words that were present in an INL word list but absent in a first version of the corpus. To support searches by lemma and part of speech, the corpus has been annotated with lemmas and POS-tags using the technology which was originally developed for the Dutch PAROLE corpus (Does, Van der Voort van der Kleij 2002): a combination of statistical taggers including TnT3 and three taggers developed at the INL. Lemmatisation was a deterministic procedure, based on an extensive lexicon developed within INL.

See Description of the tagset ANW.

More information in Dutch available here.


Bibliography

Tiberius, Carole and Adam Kilgarriff (2009). The Sketch Engine for Dutch with the ANW corpus. In E. Beijk et al. (eds.). Fons Verborum: Feestbundel Fons Moerdijk. Amsterdam: Gopher BV., pp. 237–255.

Schoonheim, Tanneke and Rob Tempelaars (2010). Dutch Lexicography in Progress, The Algemeen Nederlands Woordenboek (ANW). In Anne Dykstra and Tanneke Schoonheim (eds.), Proceedings of the XIV Euralex International Congress. Ljouwert, Fryske Akademy/Afûk, abstract, pp. 718–725