Historical Corpus of German Newspapers 1650–1800

The GerManC corpus is a representative Historical Corpus of German Newspapers of the period 1650–1800 distributed by University of Oxford Text Archive.

The corpus consists of short text samples of some 200 words each from German newspapers of the early modern period 1650–1800. The corpus metainformation contains full bibliographic details of the original texts, e.g. region, genre, year of publication, author, title, etc. Texts are divided into three main parts fifty-year subperiod (1650-1700, 1701-1750 and 1751-1800).

Conversion process

The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.

Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide word sketches which enable to explore the grammatical behavior of German in the early modern period.

Part-of-speech tagset

The GerManC POS tagging scheme is based on the STTS tagset for German, with a number of modifications to account for differences between modern and Early Modern German. The POS annotations in GerManC were produced by the re-trained version of the TreeTagger tool. See the STTS tagset for German.

Attributes

Attributes available in the corpus

For all tokens:

  • word – original word form
  • tag – TreeTagger output (see the tagset summary)
  • lempos – lemma+part_of_speech (based on TreeTagger output)

Based on original tagging (partially unavailable):

  • lemma – base lemma (in its modern form)
  • norm – normalized word form
  • lc – lowercase normalized word form
  • morph – morphological information
  • tag2 – part-of-speech (original tagger output)
  • ptag – syntactic category (original tagger output)
  • kind – (word, number, punctuation, etc…)
  • pID – word id in sentence (used by parser)
  • pDepID – dependency relation (parser output)

Authors

The corpus was prepared by Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt.

Changelog

Bibliography

Durrell, Martin; Ensslin, Astrid and Bennett, Paul (eds.). GerManC. A Historical Corpus of German Newspapers 1650-1800 [Electronic resource].

Attachments

Search the GerManC corpus

Sketch Engine offers a range of tools to search the GerManC corpus.

or

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.