The Norwegian Web 2015 is a web corpus from TenTen corpora family. The corpus contains almost 1.7 billion words and was cleaned from the Danish language and Norwegian language standards – Bokmål and Nynorsk – were separated to two subcorpora.

The corpus is tagged with Oslo-Bergen Tagger with a tagset summary described here.

The tags are available in two corpus attributes:

  • tag – a part of speech
  • tag_attrs – a morphological detail

For instance, where the original tag is “pron ent pers hum nom 1” (pronoun singular personal human nominative 1):

  • tag = “pron”
  • tag_attrs = “ent pers hum nom 1”

Changelog

2016

  • Danish removed
  • Bokmål and Nynorsk separated

2015

  • crawled new version – 1.7 billion words

v. 1.0 (21 February 2012)

  • initial version – 770 million tokens