• <doc> is a document, which usually corresponds to a single web page. Can have multiple attributes like URL/source (the source document), author, date/crawl_date (date of creation or date of collecting from the web).
  • <p> is a paragraph. Can have attribute heading (value “1” means the paragraph is a heading/caption).
  • <s> is a sentence.
  • <g> is a “glue” tag, we use it to denote word boundaries without space (so its main purpose is for visualising concordances).
  • <gap> denotes a gap that has been created by one of our tools, mostly due to de-duplication or removal of boilerplate (cleaning HTML pages – navigation, short ads etc.)
  • <a> is a URL or a text that was a link to the original document

There might be others, per-corpus defined tags used as well, which can be searched on the information pages of particular corpora.