Writing term grammar

Information

Writing term grammar is rather meant for advanced users.
For better understanding, make sure that you know the syntax of Sketch grammar and are able to write a sketch grammar.

This page is a short manual to creating term grammars. A term grammar tells Sketch Engine which words and phrases should indentify as terms, e.g. a combination of preposition + verb + preposition will not be considered a valid term structure in most languages while adjective + (optional) adjective + noun will.

Generally, there is one term grammar for each language, however, additional grammars for domains requiring specific term descriptions can be easily produced.

The term grammar is used in the automatic terminology extraction function.

The term grammar file

The term grammar file is an input for the program which compiles terms (compilecorp or genws see compiling corpora). It is a text (ASCII) file containing CQL queries.

  • a hash # at the beginning of the line indicates a comment
  • a line with “=terms” must introduce term grammar relation(s)
  • there should be one CQL query per line, use a backslash at the end of the line to split a CQL query into multiple lines
  • each position in a term grammar must be labelled, e.g. “1:noun”
  • lines beginning with an asterisk * are called processing directives

*STRUCTLIMIT s  ensures that the query results appear inside the same structure (e.g. sentence). This ensures that tokens making up a term are all found inside the same sentence. This directive is active until the end of the file or to the next *STRUCTLIMIT directive.

*DEFAULTATTR tag sets the default attribute for query evaluation. This directive is active until the end of the file or to the next *DEFAULTATTR directive.

*COLLOC stands at the beginning of each line with a term grammar relation according to the pattern “*COLLOC “%(n.attr)”, where n is the numeric label used in the query, and attr is the attribute name, e.g.  *COLLOC “%(1.gender_lemma)”

Structure of term grammar

Writing a term grammar is similar to the writing a sketch grammar. Generally, a term grammar consists of a heading and a term grammar.

Heading

Start the definition with a heading where you describe basic information as an author, date, version and POS tagset:
# Term Definition for Russian, RFTagger Multex East tagset
# by John Smith
# version 1.0
# Tagset doc: http://example.com/tagset.html
#
# Changelog
# - 17 January 2014, John Smith
#   Created

Term grammar

Similar to sketch grammars, a term grammar is written in the m4 macro language. It helps to keep the grammar simple and easy to manage because syntax can be abbreviated. For examples with explanations see the Macros in m4 section on the Writing a Sketch grammar page.

Always use a .m4 filename extension when supplying a term definition in m4.

The following example shows macro term definitions. Macros are optional but recommended.

divert(-1) 
define(`noun',`[pos="N"]') 
define(`adj',`[pos="A"]') 
define(`noun_genitive',`[pos="N" & case="g"]') 
define(`adj_genitive',`[pos="A" & case="g"]') 
define(`agree',`$1.gender=$2.gender & $1.number=$2.number & $1.case=$2.case')
#macro definiton of agreement in grammatical categories of tokens (the line above)
divert

Term grammar syntax

The example below identifies phrases such as “protected natural reserve”.

=terms
*COLLOC "%(2.gender_lemma)_%(3.gender_lemma)_%(1.lemma_lc)-x" 
2:adj 3:adj 1:noun & agree(1,2) & agree(1,3)
  1. line: a name of term grammar
  2. line: a definition of the whole form of a phrase by directive COLLOC. The phrase contains 3 words and each one of them in particular forms (attributes): gender respecting lemmas of the word with the label “2.”, gender respecting lemmas of the word with the label “3:”, lowercased form of the word with the label “1:”. Each word is written in the round brackets with the percentage “%” in front of them. The whole form of the phrase is closed in quotation marks. It is permitted to use only attributes used in the corpus.
    (do not use the ending "-x" if word sketches in the corpus are based on lemmas instead of lemposes)
  3. line: a query in the CQL language corresponding with the phrases that we want to cover by this rule. The query uses defined macros and it is expanded to 2:[pos="A"] 3:[pos="A"] 1:[pos="N"] & 1.gender=2.gender & 1.number=2.number & 1.case=2.case & 1.gender=3.gender & 1.number=3.number & 1.case=3.case Defined labels are used on the 2nd line. The label “1” specifies the main word of the phrase (called headword) and it is usually assigned to the most important noun in the phrase. You can check the correctness of the query by searching in the concordance search of Sketch Engine.

It is a good rule to write a comment with an example describing which terms are defined by this rule. We would recommend to write comments in English or bilingually in English and the language of term grammars.

Examples of term grammars

This example of English term definitions can be a good starting point to writing term definitions for analytic or isolating languages.

# == Term extraction grammar for English ==
# version 2.4
# Based on WIPO from January 2013: (N|Adj)* N (of (N|Adj) N)*)
# 2015-12-02 MJ adopted for Susanne corpus
# 2013-03-27 VS created
# 2013-04-26 VS negative ending
# 2013-07-29 Revised according to final WIPO grammar (Vojta)
# 2013-08-01 added "-x" because of implicit WSSTRIP 2 (Vojta + VitS)

*STRUCTLIMIT p

divert(-1)
define(`noun',`[tag="NN.*"]')
define(`modif',`[tag="NN.*" | tag="JJ" | tag="VVG.*"]')
define(`wof',`[lc="of"]')
define(`not_noun',`[tag!="N.*"]')
divert

=terms

*COLLOC "%(1.lc)-x"
1:noun

*COLLOC "%(2.lc)_%(1.lc)-x"
2:modif 1:noun

*COLLOC "%(3.lc)_%(2.lc)_%(1.lc)-x"
3:modif 2:modif 1:noun

*COLLOC "%(4.lc)_%(3.lc)_%(2.lc)_%(1.lc)-x"
4:modif 3:modif 2:modif 1:noun

*COLLOC "%(1.lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:wof 3:modif 4:noun

*COLLOC "%(5.lc)_%(4.lc)_%(3.lc)_%(2.lc)_%(1.lc)-x"
5:modif 4:modif 3:modif 2:modif 1:noun

*COLLOC "%(2.lc)_%(3.lc)_%(1.lc)_%(4.lc)_%(5.lc)-x"
2:modif 3:modif 1:noun 4:wof 5:noun

*COLLOC "%(2.lc)_%(1.lc)_%(3.lc)_%(4.lc)_%(5.lc)-x"
2:modif 1:noun 3:wof 4:modif 5:noun

*COLLOC "%(1.lc)_%(2.lc)_%(3.lc)_%(4.lc)_%(5.lc)-x"
1:noun 2:wof 3:modif 4:modif 5:noun

This example of Slovenian term definitions can be a good sample how to define terms for inflected languages.

# Term Definition for Slovene, Multext-East tagset
# Original Russian definition by Maria Khokhlova
# transformed for Czech by Vit Suchomel
# transformed for Slovene by Darja Fišer
# Tagset doc: http://nl.ijs.si/ME/V5/msd/html/msd-sl.html
# version 1.1
#
# Changelog
# - 21 Oct 2016, Darja Fišer
# - 04 Jan 2017, Darja Fišer
# - 15 Jan 2017, Darja Fišer, rule N attributes

*STRUCTLIMIT s
*DEFAULTATTR tag

divert(-1)
define(`noun',`[tag="^So.*"]')
define(`adj',`[tag="^P.*"]')
define(`pre',`[tag="^D.*"]')
define(`conj',`[tag="^V.*"]')
define(`adv',`[tag="^R.*"]')
define(`verb',`[tag="^Gp.*" & lemma!="biti"]')
define(`noun_genitive',`[tag="^So..r.*"]')
define(`noun_dative',`[tag="^So..d.*"]')
define(`noun_accusative',`[tag="^So..t.*"]')
define(`noun_instrumental',`[tag="^So..o.*"]')
define(`adj_genitive',`[tag="^P....r.*"]')
define(`agree',`$1.g=$2.g & $1.n=$2.n & $1.c=$2.c') #agreement of gender, number and case
define(`agree_case',`$1.c=$2.c') #agreement of case
divert

=4terms

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:noun_genitive 3:noun_genitive 4:noun_genitive
#izboljšanje kvalitete življenja bolnikov
#"Nc.*" "Nc.*g.*" "Nc.*g.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:pre 3:noun 4:noun_genitive & agree_case(2,3)
#adheziv na osnovi topil
#"Nc.*" "S.*" "Nc.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:pre 3:adj 4:noun & agree_case(2,4) & agree(3,4)
#tisk na papirne podlage
#"Nc.*" "S.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:adj_genitive 3:noun_genitive 4:noun_genitive & agree(2,3)
#gostota prostih nosilcev naboja
#"Nc.*" "A.*g.*" "Nc.*g.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:adj_genitive 3:adj_genitive 4:noun_genitive & agree(2,4) & agree(3,4)
#metoda magnetronskega ionskega naprševanja
#"Nc.*" "A.*g.*" "A.*g.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:noun_genitive 3:pre 4:noun & agree_case(3,4)
#odvod napetosti po času
#"Nc.*" "Nc.*g.*" "S.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:noun 2:noun_genitive 3:conj 4:noun_genitive
#preiskava materialov in konstrukcij
#"Nc.*" "Nc.*g.*" "C.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lemma_lc)_%(4.lc)-x"
1:noun 2:conj 3:noun 4:noun_genitive
#horizontala in vertikala izobraževanja
#"Nc.*" "C.*" "Nc.*" "Nc.*g"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.gender_lemma)_%(4.lemma_lc)-x"
1:noun 2:conj 3:adj 4:noun & agree(3,4)
#molekula in molekulska skupina
#"Nc.*" "C.*" "A.*" "Nc.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)_%(4.lc)-x"
1:adj 2:noun 3:noun_genitive 4:noun_genitive & agree(1,2)
#terciarno področje uporabe računalnika
#"A.*" "Nc.*" "Nc.*g.*" "Nc.*g.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)_%(4.lc)-x"
1:adj 2:noun 3:pre 4:noun & agree(1,2) & agree_case(3,4)
#kovinsko držalo za slojnik
#"A.*" "Nc.*" "S.*" "Nc.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)_%(4.lc)-x"
1:adj 2:noun 3:adj_genitive 4:noun_genitive & agree(1,2) & agree(3,4)
#shematični prikaz triplastnega odtisa
#"A.*" "Nc.*" "A.*g.*" "Nc.*g.*"

*COLLOC "%(1.gender_lemma)_%(2.gender_lemma)_%(3.lemma_lc)_%(4.lc)-x"
1:adj 2:adj 3:noun 4:noun_genitive & agree(1,3) & agree(2,3)
#Keesomov dipolni efekt usmerjanja
#"A.*" "A.*" "Nc.*n.*" "Nc.*g.*"

*COLLOC "%(1.gender_lemma)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:adj 2:pre 3:adj 4:noun & agree_case(2,3) & agree_case(2,4) & agree(3,4)
#merjen v protiurni smeri
#"A.*" "S.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lc)-x"
1:adj 2:pre 3:noun 4:noun & agree_case(2,3)
#odvisen od časa izpostavljenosti
#"A.*" "S.*" "Nc.*" "Nc.*g.*"

*COLLOC "%(1.gender_lemma)_%(2.lc)_%(3.gender_lemma)_%(4.lemma_lc)-x"
1:adj 2:conj 3:adj 4:noun & agree(1,4) & agree(3,4)
#prevodni in polprevodni polimer
#"A.*" "C.*" "A.*" "Nc.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)_%(4.lemma_lc)-x"
1:adj 2:noun 3:conj 4:noun & agree(1,2)
#električen prevodnik in dielektrik
#"A.*" "Nc.*" "C.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.gender_lemma)_%(3.lemma_lc)_%(4.lc)-x"
1:adv 2:adj 3:noun 4:noun_genitive & agree(2,3)
#računalniško generirani model molekul
#"R.*" "A.*" "Nc.*" "Nc.*g.*"

*COLLOC "%(1.lc)_%(2.lemma_lc)_%(3.lc)_%(4.lc)-x"
1:adv 2:adj 3:pre 4:noun & agree_case(3,4)
#vnaprej obsojen na neuspeh
#"R.*" "A.*" "S.*" "Nc.*"

*COLLOC "%(1.lc)_%(2.gender_lemma)_%(3.gender_lemma)_%(4.lemma_lc)-x"
1:adv 2:adj 3:adj 4:noun & agree(2,4) & agree(3,4)
#električno funkcionalna tiskarska barva
#"R.*" "A.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.gender_lemma)_%(4.lemma_lc)-x"
1:verb 2:conj 3:adj 4:noun & agree(3,4)
#delovati kot sprožitveni faktor
#"V.*" "C.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)_%(4.lemma_lc)-x"
1:verb 2:conj 3:adv 4:adj
#pokazati kot statistično značilen
#"V.*" "C.*" "R.*" "A.*"

=3terms

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:noun 2:noun_genitive 3:noun_genitive
#mesto vnosa sile
#"Nc.*" "Nc.*g.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:noun 2:adj_genitive 3:noun_genitive & agree(2,3)
#pojav kavnega obroča
#"Nc.*" "A.*g.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:noun 2:pre 3:noun & agree_case(2,3)
#enota za prevodnost
#"Nc.*" "S.*" "Nc.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)-x"
1:adj 2:noun 3:noun_genitive
#prosti nosilec naboja & agree(1,2)
#"A.*" "Nc.*" "Nc.*g.*"

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)_%(3.lc)-x"
1:adj 2:noun 3:noun_dative & agree(1,2)
#ogromna škoda podjetjem
#"A.*" "Nc.*" "Nc.*d.*"

*COLLOC "%(1.gender_lemma)_%(2.gender_lemma)_%(3.lemma_lc)-x"
1:adj 2:adj 3:noun & agree(1,3) & agree(2,3)
#organska svetleča dioda
#"A.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:adj 2:pre 3:noun_accusative & agree_case(2,3)
#odporen na korozijo
#"A.*" "S.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.gender_lemma)_%(3.lemma_lc)-x"
1:adv 2:adj 3:noun & agree(2,3)
#lahko prevodni polimer
#"R.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.gender_lemma)_%(3.lemma_lc)-x"
1:verb 2:adj 3:noun & agree(2,3)
#prevajati električni tok
#"V.*" "A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:verb 2:pre 3:noun & agree_case(2,3)
#reagirati z monomerom
#"V.*" "S.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:verb 2:noun_accusative 3:noun_genitive
#preprečiti flokulacijo nanodelcev
#"V.*" "Nc..a.*" "Nc..g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:verb 2:noun_accusative 3:noun_genitive
#doseči temperaturo sintranja
#"V.*" "Nc..a.*" "Nc.*g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:verb 2:conj 3:noun
#delovati kot regulator
#"V.*" "C.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)_%(3.lc)-x"
1:pre 2:noun 3:verb
#v tisku uveljavljati
#"S.*" "Nc.*" "V.*"

*COLLOC "%(1.lc)_%(2.lc)_%(3.lc)-x"
1:pre 2:noun 3:verb
#za etično šteti
#"S.*" "R.*" "V.*"

=2terms

*COLLOC "%(1.gender_lemma)_%(2.lemma_lc)-x"
1:adj 2:noun & agree(1,2)
#tiskarsko sito
#"A.*" "Nc.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)-x"
1:noun 2:noun_genitive
#intenziteta nihanja
#"Nc.*" "Nc..g.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)-x"
1:verb 2:noun_accusative
#preprečiti flokulacijo
#"V.*" "Nc..a.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)-x"
1:verb 2:noun_instrumental
#ustreza priporočilu
#"V.*" "Nc..i.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)-x"
1:verb 2:adv
#škropiti takoj
#"V.*" "R.*"

*COLLOC "%(1.lemma_lc)_%(2.lc)-x"
1:adv 2:verb
#skrbno nadzirati
#"R.*" "V.*"

=1terms

*COLLOC "%(1.lemma_lc)-x"
1:noun

How to upload term grammar?

Please send your term definition to support@sketchengine.eu and we will upload it to your corpus.

Testing new term grammar

Recompile your corpus with the new term definition, see compile corpus documentation. Extract terms as it is described on the Term extraction page.