There is a number of utilities available in Finlib/Manatee that make it easy to efficiently generate and store n-grams from corpora. The utilities can be clustered into 3 groups depending on their features:

Generating bigrams from a compiled corpus (genbgr, mkbgr, lsbgr, lscbgr)

Features:

  • bigram generation, storing and viewing from a compiled corpus
  • no corpus size limit

Usage:

The genbgr and mkbgr is used for generating and storing bigrams, respectively:

genbgr CORPUS ATTR MINFREQ | mkbgr BGRFILE

where CORPUS is the registry name/path of the corpus, ATTR is the attribute that should be used for generating the bigrams, MINFREQ is the minimum frequency of the bigram and BGRFILE is prefix for the bigram files, usually it is ATTR.bgr.

For viewing of stored bigrams, use the lsbgr tool:

lsbgr BGRFILE [FIRST_ID]

where BGRFILE is the same path as given above and the optional FIRST_ID attribute selects first bigram ID that will be shown (otherwise all bigrams are listed).

Example:

>genbgr susanne word 1 | mkbgr word.bgr
mkbgr word.bgr[1]: stream sorted, #parts: 1
mkbgr word.bgr[2]: temporary files renamed

>ls | grep word.bgr
word.bgr.cnt
word.bgr.idx

>lsbgr word.bgr | head -10
0       1       1
0       14      1
0       16      2
0       23      3
0       25      6
0       33      2
0       40      2
0       49      1
0       52      1
0       66      3

The 3 columns are attribute IDs of the two tokens representing the bigram and the frequency of this bigram. For converting the attribute ID into the corresponding string, use the lsclex tool:

>echo -e '14\n1' | lsclex -n susanne word
14      election
1       Fulton

The lscbgr tool prints directly bigram strings and possesses more options:

lscbgr
Lists corpus bigrams
usage: lscbgr [OPTIONS] CORPUS_NAME [FIRST_ID]
     -p ATTR_NAME   corpus positional attribute [default word]
     -n BGR_FILE_PATH     path to data files
                          [default CORPPATH/ATTR_NAME.bgr]
     -f                   lists frequencies of both tokens
     -s t|mi|mi3|ll|ms|s  compute statistics:
             t     T score
             mi    MI score
             mi3   MI^3 score
             ll    log likelihood
             ms    minimum sensitivity
             d     logDice

Example:

>lscbgr -f -n word.bgr susanne | head
The     Fulton  1074    14      1
The     election        1074    36      1
The     "       1074    2311    2
The     place   1074    73      3
The     jury    1074    27      6
The     City    1074    29      2
The     charge  1074    17      2
The     September       1074    4       1
The     charged 1074    18      1
The     Mayor   1074    19      3

Generating n-grams from a compiled corpus (genngr, lscngr)

Features:

  • concurrent n-gram generation (for any n), storing and viewing from a compiled corpus
  • corpus size up to 2 billion tokens (larger corpora may be processed, but only first 2 billion tokens will be used)

Usage:

The genngr tool is used for generating and storing, the lscngr for viewing:

genngr CORPUS ATTR MINFREQ NGRFILE

The parameters for genngr have same semantics as for genbgr/mkbgr above, the prefix path is usually ATTR.ngr.

lscngr [OPTIONS] CORPUS_NAME

Options can be set as follows:

     -p ATTR_NAME       corpus positional attribute (default: word)
     -n NGR_FILE_PATH   n-grams data file path
     -f                 lists frequences
     -d STRUCT.ATTR     print STRUCT duplicates according to ATTR
     -m MIN_NGRAM       minimum n-gram size (default: 3)

Example:

>genngr susanne word 1 word.ngr
Preparing text
Creating suffix array
Creating LCP array
Saving LDIs

>ls | grep word.ngr
word.ngr.freq
word.ngr.lex
word.ngr.lex.idx
word.ngr.mm
word.ngr.rev
word.ngr.rev.cnt
word.ngr.rev.cnt64
word.ngr.rev.idx

>lscngr -f -n word.ngr susanne | head -10
2       3,4      The jury said | it     2       3       7
2       2,3      The grand | jury       2       6       9
2       3,3      The other ,    8       7       195
3       3,3      The fact that  5       27      53
2       3,3      The fact is    5       2       53
2       2,3      The purpose | of       2       7       18
2       3,3      The man was    5       6       169
2       4,4      The Charles Men ,      5       2       5
5       2,3      The Charles | Men      5       5       25
2       3,3      The New York   3       24      69

The semantic of the columns in the output listed above is as follows:

  1. n-gram frequency
  2. minimum, maximum length of the n-gram
  3. first 20 tokens of the n-gram, there is a vertical bar (“|”) after the minimum-th word of the n-gram

The following is listed only with the -f option. Given an n-gram as concatenation of strings xyiz

  1. frequency of the xyi (n-1)-gram
  2. frequency of the yiz (n-1)-gram
  3. frequency of the yi (n-2)-gram

If the optional -d STRUCT.ATTR option is given, a list of these structure attributes is printed in addition to the above output, saying which structures share a common n-gram (n being 40 by default, but might be set to a larger value using -m)

E.g.

lscngr -m 100 -f -d bncdoc.id bnc2

prints

>646#624>HHM HHK

at the end, saying that documents 646 and 624 (with IDs “HHM” and “HHK”) share a common 100-gram.

Generating n-grams from a vertical file (ngrsave)

Features:

  • concurrent n-gram generation (for any n up to the given maximum) from a vertical file
  • direct storing in a text file
  • no corpus size limit

Usage:

The ngrsave utility generates the n-grams from a vertical file and stores the in a single text file:

usage: ngrsave VERT_FILE SAVE_FILE STOPLIST_FILE [DOC_SEPARATOR NGRAM_SIZE IGNORE_PUNC]
       or
       ngrsave -c CORPUS ATTR SAVE_FILE STOPLIST_FILE [DOC_STRUCTURE NGRAM_SIZE IGNORE_PUNC]
       Prints all n-grams that occurred at least twice in the input VERT_FILE

STOPLIST_FILE    textfile with one stopword per line, n-grams will not contain any stopwords
                 (use - as STOPLIST_FILE for omitting it)
VERT_FILE        input vertical file to be processed, use - for standard input
CORPUS           corpus registry filename
ATTR             attribute name
SAVE_FILE        textfile where the output will be written
DOC_SEPARATOR    line prefix, e.g. '<doc', that will be used for separating documents
                 If given, each n-gram is followed by its frequency together with the IDs
                 of the documents where it occurred
DOC_STRUCTURE    Same as above, but name of the structure, e.g. 'doc'
NGRAM_SIZE       maximum size of the n-gram (the n), defaults to 10
IGNORE_PUNC      disables ignoring punctuation by providing a 0 value
                 (any positive number means enable, the default)

Example:

>cut -f1 susanne.vert | ngrsave - - susanne.ngrsave "<doc"
Round: 0
   Preparing text
   Creating suffix array
   Saving n-grams

>head susanne.ngrsave.out 
that    there   be      a       line    through P       which   meets   g       2       130 130 
the     case    in      which   g       is      a       curve   on      a       2       130 130 
was     stored  at      °       in      a       tube    equipped        with    a       2       123 123 
be      a       line    through P       which   meets   g       in      points  2       130 130 
at      °       in      a       tube    equipped        with    a       break   seal    2       123 123 
there   be      a       line    through P       which   meets   g       in      2       130 130 
He      handed  the     bayonet to      Dean    and     kept    the     pistol  2       136 136 
were    allowed to      stand   at      room    temperature     for     1       hr      2       126 126 
case    in      which   g       is      a       curve   on      a       quadric 2       130 130 
requires        that    there   be      a       line    through P       which   meets   2       130 130

The output contains all n-grams that occurred at least twice.