- <doc> is a document, which usually corresponds to a single web page. Can have multiple attributes like URL/source (the source document), author, date/crawl_date (date of creation or date of collecting from the web).
- <p> is a paragraph. Can have attribute heading (value “1” means the paragraph is a heading/caption).
- <s> is a sentence.
- <g> is a “glue” tag, we use it to denote word boundaries without space (so its main purpose is for visualising concordances).
- <gap> denotes a gap that has been created by one of our tools, mostly due to de-duplication or removal of boilerplate (cleaning HTML pages – navigation, short ads etc.)
- <a> is a URL or a text that was a link to the original document
There might be others, per-corpus defined tags used as well, which can be searched on the information pages of particular corpora.