Users with a local installation of Sketch Engine can run the following commands on Linux.

Overview of all command line tools

lexonomyCreateEntries dumpattrrev lsclex parencodevert genhist tokens2dict
lexonomyMakeDict dumpattrtext lscngr ske hashws virtws
ocd-mkcoll dumpstructrng manateesrv vertfork lslexarf dumpbits
ocd-mkdefs dumpwmap mkalign wm2thes lsslex dumpdrev
ocd-mkdict dumpwmrev mkdynattr addwcattr mkbidict dumpdtext
ocd-mkgdex encodevert mknormattr compilecorp mklcm lsbgr
ocd-mkhwds-plain extrms mknorms concinfo mkstats lslex
ocd-mkhwds-terms filterwm mkregexattr corpcheck mksubc mkbgr
ocd-mkthes genbgr mksizes corpquery mktrends mkdrev
ocd-mkwsi genfreq mkthes decodevert par2tokens mkdtext
setupbonito genngr mkvirt dumpalign parse2wmap mklex
biterms genws mkwmap dumpthes parws
corpinfo lsalsize mkwmrank dumpws sconll2sketch
devirt lscbgr ngrsave freqs sconll2wmap

Command line tools for n-grams

There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:

Generating bigrams from a compiled corpus (genbgr, mkbgr, lsbgr, lscbgr)

Features:

  • bigram generation, storing and viewing from a compiled corpus
  • no corpus size limit

Usage:

The genbgr and mkbgr is used for generating and storing bigrams, respectively:

genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE

where CORPUS is the registry name/path of the corpus, ATTR is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is ATTR.bgr.

For viewing of stored bigrams, use the lsbgr tool:

lsbgr BGRFILE [FIRST_ID]

where BGRFILE is the same path as given above and the optional FIRST_ID attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).

Example:

>genbgr susanne word 1 | mkbgr word.bgr
mkbgr word.bgr[1]: stream sorted, #parts: 1
mkbgr word.bgr[2]: temporary files renamed

>ls | grep word.bgr
word.bgr.cnt
word.bgr.idx

>lsbgr word.bgr | head -10
0       1       1
0       14      1
0       16      2
0       23      3
0       25      6
0       33      2
0       40      2
0       49      1
0       52      1
0       66      3

The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the lsclex tool:

>echo -e '14\n1' | lsclex -n susanne word
14      election
1       Fulton

The lscbgr tool prints directly bigram strings and possesses more options:

lscbgr
Lists corpus bigrams
usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID]
     -p ATTR_NAME   corpus positional attribute [default word]
     -n BGR_FILE_PATH     path to data files
                          [default CORPPATH/ATTR_NAME.bgr]
     -f                   lists frequencies of both tokens
     -s t|mi|mi3|ll|ms|s  compute statistics:
             t     T score
             mi    MI score
             mi3   MI^3 score
             ll    log likelihood
             ms    minimum sensitivity
             d     logDice

Example:

>lscbgr -f -n word.bgr susanne | head
The     Fulton  1074    14      1
The     election        1074    36      1
The     "       1074    2311    2
The     place   1074    73      3
The     jury    1074    27      6
The     City    1074    29      2
The     charge  1074    17      2
The     September       1074    4       1
The     charged 1074    18      1
The     Mayor   1074    19      3

Generating n-grams from a compiled corpus (genngr, lscngr)

Features:

  • concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
  • corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)

Usage:

The genngr tool is used for generating and storing, the lscngr for viewing:

genngr CORPUS ATTR MINFREQ NGRFILE

The parameters for genngr have same semantics as for genbgr/mkbgr above, the prefix path is usually ATTR.ngr.

lscngr [OPTIONS] CORPUS_NAME

Options can be set as follows:

     -p ATTR_NAME       corpus positional attribute (default: word)
     -n NGR_FILE_PATH   n-grams data file path
     -f                 lists frequences
     -d STRUCT.ATTR     print STRUCT duplicates according to ATTR
     -m MIN_NGRAM       minimum n-gram size (default: 3)

Example:

>genngr susanne word 1 word.ngr
Preparing text
Creating suffix array
Creating LCP array
Saving LDIs

>ls | grep word.ngr
word.ngr.freq
word.ngr.lex
word.ngr.lex.idx
word.ngr.mm
word.ngr.rev
word.ngr.rev.cnt
word.ngr.rev.cnt64
word.ngr.rev.idx

>lscngr -f -n word.ngr susanne | head -10
2       3,4      The jury said | it     2       3       7
2       2,3      The grand | jury       2       6       9
2       3,3      The other ,    8       7       195
3       3,3      The fact that  5       27      53
2       3,3      The fact is    5       2       53
2       2,3      The purpose | of       2       7       18
2       3,3      The man was    5       6       169
2       4,4      The Charles Men ,      5       2       5
5       2,3      The Charles | Men      5       5       25
2       3,3      The New York   3       24      69

The semantic of the columns in the output listed above is as follows:

  1. n-gram frequency
  2. minimum, maximum length of the n-gram
  3. first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram

The following is listed only with the -f option. Given an n-gram as concatenation of strings xyiz

  1. frequency of the xyi (n-1)-gram
  2. frequency of the yiz (n-1)-gram
  3. frequency of the yi (n-2)-gram

If the optional -d STRUCT.ATTR option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using -m)

E.g.

lscngr -m 100 -f -d bncdoc.id bnc2

prints

>646#624>HHM HHK

at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.

Generating n-grams from a vertical file (ngrsave)

Features:

  • concurrent n-gram generation (for any n up to the given maximum) from a vertical file
  • direct storing in a text file
  • no corpus size limit

Usage:

The ngrsave utility generates the n-grams from a vertical file and stores the in a single text file:

usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC]
       or
       ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC]
       Prints all n-grams that occurred at least twice in the input VERT_FILE

STOPLIST_FILE    textfile with one stopword per line, n-grams will not contain any stopwords
                 (use - as STOPLIST_FILE for omitting it)
VERT_FILE        input vertical file to be processed, use - for standard input
CORPUS           corpus registry filename
ATTR             attribute name
SAVE_FILE        textfile where the output will be written
DOC_SEPARATOR    line prefix, e.g. '

Example:

>cut -f1 susanne.vert | ngrsave - susanne.ngrsave - "head susanne.ngrsave.out 
that    there   be      a       line    through P       which   meets   g       2       130 130 
the     case    in      which   g       is      a       curve   on      a       2       130 130 
was     stored  at      °       in      a       tube    equipped        with    a       2       123 123 
be      a       line    through P       which   meets   g       in      points  2       130 130 
at      °       in      a       tube    equipped        with    a       break   seal    2       123 123 
there   be      a       line    through P       which   meets   g       in      2       130 130 
He      handed  the     bayonet to      Dean    and     kept    the     pistol  2       136 136 
were    allowed to      stand   at      room    temperature     for     1       hr      2       126 126 
case    in      which   g       is      a       curve   on      a       quadric 2       130 130 
requires        that    there   be      a       line    through P       which   meets   2       130 130

The output contains all n-grams that occurred at least twice.

Selected command tools in more detail:

corpinfo

Prints basic information of a given corpus.

Usage: corpinfo [OPTIONS] CORPNAME

-d dump whole configuration

-p print corpus directory path

-s print corpus size

-w print corpus wordform counts

-g OPT print configuration value of option OPT

corpquery

Prints concordance of a given query

Usage: corpquery CORPUSNAME QUERY [ OPTIONS ]

Options:

-r ATTR reference attribute

(default: None)

-c LEFT,RIGHT | BOTH left and right or both context length

(default: 15)

-h LIMIT maximum number of results

(default: -1)

-a ATTR1,ATTR2,... comma separated list of attributes to be shown

default: word,lemma,tag)

-s STR1,STR2... comma separated list of structures to be shown

(use struct.attr or struct.* to show structure attributes; default: s,p,doc)

-g GDEX_CONF use GDEX with a given GDEX_CONF configuration file

(default: None; use - for default configuration) use -h to set the result size (default: 100)

-m GDEX_MODULE_DIR GDEX module path (directory with gdex.py or gdex_old.py)

lsclex

Lists lexicon of given corpus attribute

usage: lsclex [-snf] CORPUS ATTR

-s str2id -- strings from stdin translate to IDs

-n id2str -- IDs from stdin translate to strings

-f print frequences

lsslex

Lists number of tokens for all structure attribute values

usage: lsslex CORPNAME STRUCTNAME STRUCTATTR

example: lsslex bnc bncdoc alltyp

freqs

Prints frequencies of words in a given context of a given query

usage: freqs CORPUSNAME 'QUERY' 'CONTEXT' LIMIT

default CONTEXT is 'word -1' default LIMIT is 1

examples: freqs susanne '[lemma="house"]' 'word -1'

freqs susanne '[lemma="run"]' 'word/i 0 tag 0 lemma 1' 2

freqs susanne '[lemma="test"] []? [tag="NN.*"]' 'word/i -1>0' 0

corpcheck

Checks the validity of various corpus attributes and the correctness of compiled corpus data. Any issues found with the corpus are presented in a clear, human-readable format in standard error output.

Usage: corpcheck CORPNAME