GerManC is a historical corpus of written German texts.

(This page concerns the GerManC version for Sketch Engine. The original corpus is distributed by University of Oxford Text Archive.)

Maintainer

Milos Husak, Lexical Computing (support@sketchengine.co.uk)

Conversion process

The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.

Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide wordsketches.

Tagsets

Attributes

Available for all tokens:
   word   - original word form
   tag    - TreeTagger output
   lempos - lemma+part_of_speech (based on TreeTagger output)

Based on original tagging (partially unavailable):
   lemma  - base lemma (in its modern form)
   norm   - normalized word form
   lc     - lowercase normalized word form
   morph  - morphological information
   tag2   - part-of-speech (original tagger output)
   ptag   - syntactic category (original tagger output)
   kind   - (word, number, punctuation, etc...)
   pID    - word id in sentence (used by parser)
   pDepID - dependency relation (parser output)

Statistics

   Number of words  : 667,310
   Number of tokens : 800,783

Original corpus

Authors

Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt

Data

Creation date : between 2008 and 2011.
Source        : http://www.ota.ox.ac.uk/desc/2544

Documents     : 1352 files
Size          : 163 MB

Description

Expanded and revised version of http://ota.ox.ac.uk/id/2537

The aim of the GerManC corpus project was to compile a representative historical corpus of written German for the years 1650-1800. A central initial objective was to provide a basis for comparative studies of the development of the grammar and vocabulary of English and German and the way in which they were standardized, and the structure and design of the GerManC corpus was intended to parallel that of similar historical linguistic corpora of English, notably the ARCHER corpus1 and the Helsinki corpus of English texts2. But consistent attention was paid to maintain compatibility with corpus projects in Germany covering earlier historical stages of German, initially within the framework of the DDD project (Deutsch Diachron Digital), and latterly with the various parts of the Historisches Referenzkorpus des Deutschen, which are currently being compiled at various centres in Germany3. The idea for the project goes back to an initiative by Anita Auer (now at the University of Utrecht), who completed a doctorate in Manchester on in 2005. Dr Auer’s work drew attention to the lack of corpus-based data for German during this period compared to English; she suggested undertaking the compilation of such a corpus for German and completed some preparatory work on it.

Following the model of the ARCHER corpus and given the aim of representativeness, the GerManC corpus consists of text samples of about 2000 words from eight genres: drama, newspapers, sermons and personal letters (to represent orally oriented registers) and narrative prose (fiction or non-fiction), scholarly (i.e. humanities), scientific and legal texts (to represent more print-oriented registers). In order to facilitate tracing historical developments, the whole period was divided into fifty year sections (in this case 1650-1700, 1700-1750 and 1750-1800), and an equal number of texts from each genre was selected for each of these sub-periods.


Bibliography

Durrell, Martin; Ensslin, Astrid and Bennett, Paul (eds.). GerManC. A Historical Corpus of German Newspapers 1650-1800 [Electronic resource].

Attachments

See Documentation of GerManC (in pdf)

Appendix1 (in xlsx)

Appendix2 (in pdf)