A web medical corpus has been collected using the WebBootCat with DANTE seeds. The corpus contains 35 million words from the medical domain.

For more details about the approach see, Avinesh PVS, Diana McCarthy, Dominic Glennon & Jan Pomikálek, Domain Specific Corpora from the Web.

The data was prepared for the Sketch Engine using a lemmatiser, part-of-speech tagged using TreeTagger with UTF-8 English parameter file.

For information about the used POS tagset, see the English Penn TreeBank part-of-speech tagset.