CQL – search structures

Structures refer to sentences, paragraphs, documents or any other parts or sections into which a corpus might be divided. Another way of saying this is that certain parts of a corpus can be labelled, e.g. direct speech might be labelled using structures to make it possible to limit the search to only direct speech.

Introduction to structures and values

If a corpus has structures (most corpora have them), the structures may or may not have values. Values are used to categorize instances of the same structure.

A corpus with only structures

Structures can be different in each corpus but most corpora will use at least s, p and doc (sentences, paragraphs and documents respectively) as seen in the examples below.  To see the list of all structures used in the corpus, go to the Corpus info page. The page is also useful to check how the common structures are labelled. For example, sentence can use s, sent, sen or snt. This may vary between corpora.

<doc>
  <p>
    <s>Baa, baa, black sheep,</s>
    <s>Have you any wool?</s>
  </p>
  <p>
    <s>Yes, sir, yes, sir,</s>
    <s>Three bags full;</s>
  </p>
</doc>

Each structure can have values, e.g. a paragraph might be labelled with a date when it was written or the name of the author, and these values can also be used as search criteria, for example to find the occurrences of the word BMW but only in texts (documents) written in 1970.

A corpus with structures and values

<doc id="20110658A" pub=”1970” lang=”en”>
	<p style = ”formal”>
		<s><pers gender=”female”>Rebecca</pers> has worked with a full range of clients including <brand sect=”automotive”>BMW</brand> and <brand sect=”air”>Airbus</brand>.</s>
		<s> She showed her attention to detail and overall competence.</s>
	</p>
	<p style = “informal”>
		<s>I remember her first day at work where she knocked her coffe over spilt it over documents.</s>
		<s>Such fun!</s>
	</p>
</doc>
<doc pub=”1977”>
	<p style = ”informal”>
		<s>first sentence</s>
		<s>second sentence</s>
	</p>
	<p style = “informal”>
		<s>third sentence </s>
		<s>fourth sentence</s>
	</p>
</doc>

The author of the above corpus decided to label brand names with a structure and assign each brand name a value indicating the industry. Now these searches and statistics are possible:

  • find the occurrences of the word BMW but only in texts (documents) written after 1970
  • calculate the frequency of brand names in informal texts
  • compare the frequency of brand names from each industry in texts published before and after 1970

Searching for structures

Structures are especially useful together with within and containing operators.

Referring to structures

Structures can be referred to in three ways:

the beginning

To refer to the beginning of the structure, e.g. to find sentences starting with…,  paragraphs starting with… etc., use:

<s> <p> <doc> etc.
the end

To refer to the end of the structure, e.g. to find paragraphs ending with…,  documents ending with…, words that appear at the end of a sentence/paragraph/document etc., use:

</s> </p> </doc> etc.
the whole structure

To refer to the the whole structure, ie. all tokens inside a sentence, paragraph, document etc., for example to find sentences, paragraphs documents etc. containing or not containing a word or phrase, use:

< s/> < p/> < doc/> etc.

Example searches

to find all documents written in informal style that start with the word Rebecca

<doc style="informal">[lemma="Rebecca"]

to find all documents whose ID is 2011 and they start with a noun followed by a verb at a distance of up to 5 words, use:

<doc id="2011"> [tag="N.*"] []{0,5} [tag="V.*"]

or combining more structures together in one query

<doc id="2011" & type="written"> [tag="N.*"] []{0,5} [tag="V.*"]

to find all verbs written in informal style, i.e. verbs found inside documents annotated with ‘formal’ as text type:

[tag="V.*"]  within <doc style="informal" />

For more examples exploiting structures, see CQL – within & containing