Regular expressions is a convention of using some characters instead of unspecified letters or numbers. They are used to set criteria for strings of characters, e.g. words or tags,  which have a common pattern, e.g. start the same way, finish the same way or contain certain characters.

Regular expressions are used mainly inside CQL, in word lists and n-grams.

This page only gives a few basic examples, please refer to Wikipedia, try our regular expressions exercises or this interactive course.

Wild cards

Wild cards are not regular expressions but users know them from other software. They are only supported in the simple concordance search.

Using wild cards in simple concordance search

Only in simple concordance search, the asterisk (*) and question mark (?) can be used like this:

asterisk (*) stands for zero or more characters
test* will find
test, tests, tested, testing

c*t will find
CT, cut, cat, craft, construct

question mark (?) stands for exactly 1 character
test? will find
tests, Testa, testy
but will not find
test

c?t will find these lemmas
cat, cut
BUT! simple search always treats each search word as a lemma, thus c?t will search for the lemmas cut, cat and cot. These lemmas will produce results which include all word forms. The final concordance will thus show: cut, cutting, cat, cats, cot, cots, etc.

To search for the asterisk and question mark , use \* and \?

Regular expressions

With all the other concordance searches and with wordlists, regular expressions are used in its standard form.

Video manual

This video introduces searching words with an apostrophe and using operators * and ?.

Note: for finding “cannot” type “can not” due to pre-processing texts when “not” is separated (information on this search is not complete in the video).

dot ‘ . ‘

A dot stands for a single unspecified character.

regular expressionmatching result(s)
w.nwin won wen wun wan
ca.cat car cap cab can

question mark ‘ ? ‘

The question mark stands for zero or more occurrences of the preceding character

regular expressionmatching result(s)
be?tbt bet
(but will not find beet beeet beeeet)
bet?be bet
(but will not find bets betting)
.?atat hat bat cat mat
(zero or one unspecified character at the beginning)

asterisk ‘ * ‘

The asterisk stands for zero or more occurences of the preceding character.

regular expressionmatching result(s)
co*lCL col cool coool cooool
hallo*hall hallo halloo hallooo halloooo
c.*ingwords startin with c- and ending with -ing (i.e. having any number of unspecified characters between c and ing)
cycling camping cutting cooking contemplating
*oolproduces error, no character precedes the asterisk
c.*word beginning with the letter c
(c is followed by any number of any character)
.*edword ending with -ed
(the word starts with any number of any character)

digits ‘ \d ‘ and \D

\d stands for a digit, i.e. characters 0-9, \D stands for any non-digit character

regular expressionmatching result(s)
bdb1 b2 b3 b4
bd*b b1 b12 b89 b43958
(zero or more digits after b)
ddb58b 46b 89b
(b preceded by two digits)

range ‘ [ ] ‘

use square brackets to specify a list or range
[bmpg] stands for b OR m OR p OR g
[a-d] stands for a letter between a and d
[3-5] stands for a digit between 3 and 5

regular expressionmatching result(s)
[mpgb]etmet pet get bet
m[2-5]m2 m3 m4 m5
m[2-5]*m m22 m52 m3425 m23453234 m222345
(m followed by zero or more digits between 2 and 5)

not ‘ ^ ‘

use ^ to indicate that the character(s) should not be included, the characters have to be enclosed in square brackets

regular expressionmatching result(s)
[^m]etpet get bet let
(but will not find met)
[^mpg]etset let
(but will not find met pet get)

or ‘ | ‘

the pipe | is used to indicate OR

regular expressionmatching result(s)
get|metwill find lines which contain the word pet OR the word met

plus ‘+’

the plus stands for ‘one or more repetitions of the preceding character’

regular expressionmatching result(s)
hallo+hallo halloo hallooo hallooooooooo
(but not hall)
.+atbat, great, format, cat
(but not ‘at’, to include ‘at’, use .*at)

case sensitivity (?n)’

regular expressions are always case sensitive, i.e. Bill is different from bill. To make a letter case insensitive, use a question mark and brackets like this:


regular expressionmatching result(s)
(?b)illbill Bill

repetition { }

use curly brackets to indicate repetition of the preceding character

regular expressionmatching result(s)
halo{3}halooo
(exactly 3 repetitions of the letter o)
hallo{2,4}haloo hallooo hallooo
(from 2 to 4 repetitions of the letter ooo
.{6}anyone playmat bottle
(words consisiting of any 6 letters, it is equivalent to typing 6 dots …… )

grouping ( )

any part of a regular expression can be surrounded by parentheses to make it a single unit onto which other regular expressions can be applied

regular expressionmatching result(s)
(dis)?connectconnect disconnect
(question mark makes the preceding element ‘(dis)’ optional)
(bla){3,4}blablabla
blablablabla

escaping

to search for characters . ? * which already have a special function in regular expressions, you have to put a backslash in front of them, this is called escaping (e.g. you have to escape a question mark) Characters $ and # in part of speech tags also have to be escaped.

regular expression

.

\.

ok?

ok\?

\

\\

matching result

a b c d e f g h etc. (all alphanumeric characters)

.

o ok (question mark makes the preceeding character optional)

ok?

produces error, backslash escapes the following character but no such character exists

\

backreferences

since manatee 2.65 It is possible to place brackets around one or several parts of a regular expression and refer to those parts later. The first part in brackets is referred to with the number 2 (not 1!!!), the second with number 3, etc.

regular expressionmatching result(s)
(abra)kad\2
(the number must be escaped)
abrakadabra
(a)(b)(c)\4\3\2abccab