PennHistEn is a collection of historical English texts ranging from Middle English to Modern British English (mid 12th to early 20th century).

(This page concerns the PennHistEn version for Sketch Engine. The original collection is distributed by University of Pennsylvania.)

Maintainer

Milos Husak, Lexical Computing (support@sketchengine.co.uk)

Conversion process

The original corpus consists of three parts (Penn-Helsinki Parsed Corpus of Middle English, second edition – PPCME2, the Penn-Helsinki Parsed Corpus of Early Modern English – PPCEME, and the Penn Parsed Corpus of Modern British English – PPCMBE) that differ a bit in the way they were annotated.

The Sketch Engine version of PennHistEn contains almost all metadata and tagging as the original corpus (which itself retains most, but not all of the markup of the source corpora – e.g. line breaks and paragraphs were not preserved – more info) and it was further normalized so as it was easier to treat the whole collection as one corpus.

Document metadata

Empty values (all “X”es and “n/a”) were replaced with “===NONE===” value.

Structures

documents

There were two sets of document meta-information in the original corpora, they are both present with ‘PPC_’ and ‘Helsinki_’ prefixes as arguments of <doc> structure. These values were merely minimally edited (expanded abbreviations, normalized letter cases, removed white symbols, etc..)

Two other attributes were devised from corresponding values from both annotations:

  • Author – equals to ‘PPC_Author’ if available, otherwise ‘Helsinki_Author’ with only first letters in uppercase
  • Title – equals to ‘PPC_Text_name’ if available, otherwise ‘Helsinki_Text_name’ with only first letters in uppercase
  • Date – 50 years wide intervals filling the period between the earliest to latest year mentioned in the date-related attributes:

comments

Various edits, comments, etc. are tagged using an unary tag <COM/>. The content of the comment can be accessed as its attribute called ‘value’ and different origin or way the commenents were tagged in the original corpus is distinguished by its attribute ‘type’ .

Attributes

  • ascii – The original corpus used certain conventions to encode non-ascii characters, ligatures, superscripts, etc… The ascii attribute contains the original ascii encoded form of the tokens (see http://icame.uib.no/hc/#con32 for details).
    • superscripts: =X=
    • accents: X’
    • non-ascii symbols:
ascii form uppper case symbol name ascii form lower case
+A Æ ash +a æ
+D Ð eth +d ð
+G Ȝ yogh +g ȝ
+TT, +Tt crossed thorn +tt
+T Þ thorn +t þ
e caudata +e ę
+o œ
+L £ pound sign
  • unicode attribute will be displayed “as-close-as-possible” to the original form of the text; most signs, ligatures, superscripts were converted to their unicode counterparts. The most prominent issue with conversion to the unicode is the fact that the original format does not distinguish between different accents, so all accents were just replaced with “combining vertical line above” (as a sort of neutral accent which makes appostrophe in they’re thethey̍re) and all abbreviations, flourishes, tildas etc… were replaced with “combining tilda” as in Cobh̃m.
  • word is the default form of tokens used in Sketch Engine; for practical reasons, it does not distinguish between different forms of superscipted forms and all ligatures and historical letters were replaced with their closest latin letter counterparts (following the ascii encoding, but ommiting the ‘+’ sign).
  • lc – lowercase normalized word form
  • tag – POS tag provided along with the original corpus

Tagsets

Statistics

Number of words  : 3,800,639 	
Number of tokens : 4,404,931

Documents        : 605 files
Size             : 140 MB (vertical uncompressed)
                   125 MB (compiled corpus)
                    42 MB (compressed source)

Original corpus

Authors

  • Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Ariel Diertani. 2010. The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).

Description

(from http://www.ling.upenn.edu/histcorpora/)

The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history – from the earliest Middle English documents up to the First World War. The texts come in three forms: simple text, part-of-speech tagged text and syntactically annotated text. The syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. All of the annotation has been carefully checked by expert human annotators for accuracy and consistency. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language, and they are publicly available to individuals, research groups and libraries.


How to citate… (see more)

The Penn Parsed Corpora of Historical English should be cited individually rather than as a single bibliographic entry. The citation should include the website of the corpus, its edition, and its date of release. Here are the proper citations as of June 1, 2011:

  • Anthony Kroch and Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English (PPCME2). Department of Linguistics, University of Pennsylvania. CD-ROM, second edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Lauren Delfs. 2004. The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).
  • Anthony Kroch, Beatrice Santorini, and Ariel Diertani. 2010. The Penn-Helsinki Parsed Corpus of Modern British English (PPCMBE). Department of Linguistics, University of Pennsylvania. CD-ROM, first edition, (http://www.ling.upenn.edu/hist-corpora/).