2016/03/31

  • encodevert: call mknormattr according to MAPTO directive
  • added support for normalization attribute
  • ANTLR CQL grammar supports description definition

2.135.5

2016/02/28

  • tstquery: added queries on parallel corpora
  • tstquery: print executed queries
  • do not label aligned corpus query in WITHIN!/!WITHIN queries

2.135.4

2016/02/21

  • compilecorp: always move logfile into corpus path directory
  • compilecorp: improved error reporting to indicate actual lines numbers

2.135

2016/01/30

  • encodevert: better manipulation with lexicon added items cache

2.134

2016/01/20

  • encodevert: dynamic lexicons cache sizes
  • reformat mkwmrank.cc
  • added bgr_abs_freq_coll association score
    • returns frequency of the first word of the collocation pair

2.133.4

2015/12/12

  • mktrends: finalize output files properly

2.133.3

2015/12/10

  • corpcheck: tolerate local path in INFOHREF

2.133.3

2015/12/10

  • mktrends: finalize output files properly

2.133.2

2015/12/07

  • fix handling of aligned corpora labels in Concordance

2.133.1

2015/12/03

  • KWICLines skip aligned corpora collocations

2.133

2015/12/02

  • CQL: added support to term queries using term() operator
  • compilecorp: added –no-ske option being default for NoSkE

2.132.1

2015/11/30

  • tstregexopt: takes attribute as another optional argument

2.132

2015/11/24

  • speed up RQinNode and RQcontainNode

2.131.3

2015/11/24

  • mknorms: speed up computation for subcorpora

2.131

2015/11/12

  • removed findPosAttr() functions
  • reformat corpinfo.cc

2.130.6

2015/11/12

  • fix !WITHIN <alignedcorpus>

2.130.5

2015/11/08

  • compilecorp: call mktrends with EPOCH_LIMIT being 1
  • fix MAXKWIC being 0 not meaning unlimited MAXKWIC

2.130.3

2015/11/04

  • mktrends, save subcorp data properly

2.130.2

2015/10/31

  • added NonEmptyRS for filtering empty RangeStream ranges

2.130

2015/10/25

  • KWICLines has new method is_defined() and short-circuits processing of undefined lines
  • added Concordance::filter_aligned() for filtering by aligned corpus

2.129

2015/09/21

  • mktrends: speed up ca 15x by more usage of numpy

2.128.4

2015/09/10

  • updated CQL testsuite with current WS results on susanne

2.127

2015/08/04

  • compilecorp: added support for longest commonest match

2.126

2015/07/28

  • compilecorp: added support for trends computations
  • added mktrends script prepared by Ondřej Herman

2.125.2

2015/07/20

  • mkwmrank: computing scores for each gramrel is independent of other gramrels

2.124

2015/05/02

  • concordance automatically detects all collocations

2015/04/19

Bugfixes:

  • fix CQL inequality comparisons on dynamic attributes

2.121.2

2015/04/08

  • disable MULTIVALUE freqdist for positional attributes

2.121

2015/04/03

  • mkdynattr: no need to manually delete lexicon with new write_lexicon
  • added new DYNTYPE “freq” for dynamic attributes
  • compilecorp and parws pass WSMINHITS to mkwmap
  • mkwmap: added all options to usage
  • mkwmap: added -f option allowing filtering for minimum frequency
  • write_lexicon allows overwriting datafiles
  • compilecorp: hashws terms automatically
  • compilecorp: write manatee version to log

Bugfixes

  • fix empty KWICLines structure context for empty KWIC

2.120.1

2015/03/29

Bugfixes:

  • genws: fix SEPARATEPAGE index for grammars using DUAL

2.120

2015/03/28

  • freqs: allow filtering by subcorpus
  • new freq_dist() attribute modifier “/n” for getting IDs intead of string

Bugfixes:

  • fix regexp2ids/regexp2poss for patterns with escaped metacharacters
  • compilecorp: ‘skipping biterms’ message fixed

2.119

2015/03/23

  • genngr: allow setting min and max n-gram length from cmdline
  • genngr: limit maximum n-gram length to 30 by default

2.118

2015/03/22

Bugfixes:

  • fix build with gcc 4.4 (RHEL/CentOS 6)
  • fix ConcStream::find_beg()/find_end()

2.117

2015/02/24

  • create_subcorpus() takes an optional Structure argument

2.116

2015/02/23

  • dumpalign supports 1:1

2.115.3

2015/02/23

  • mkwmrank: fix segfault when datafiles cannot be open
  • updated package specfiles to contain lsalsize

2.115.2

2015/02/10

  • updated tstquery gold results after word sketch format change
  • compilecorp: compute sizes after alignment
  • added lsalsize binary for listing alignment size of two corpora
  • mksizes: use lsalsize to compute alignment size

Bugfixes:

  • fix showing GDEX scores when references are up
  • Fix GDEX score display in concordance view
  • manatee: fix installing binaries on DEB
  • corpquery: fix parallel queries garbled by fake collocates

2.115.1

2014/02/10

  • manatee: script for bilingual term extraction

2.115

2014/01/21

  • CorpInfo may be modified and is exported into SWIG API
  • added dumpalign script for dumping parallel corpora

2.114

2015/01/18

  • CQL supports regular expressions in word sketch gramrels
  • added regexp2ids() for word sketch gramrels
  • added mklex for creating lexicons

2.113

2015/01/14

  • mkwmrank: added parameter for commonest match input
  • WSATTR defaults to lempos_lc -> lempos -> lemma_lc -> lemma -> DEFAULTATTR

2.111.8

2014/11/23

  • updated tstquery gold results after word sketch format change

Bugfixes:

  • genws: fix handling invalid STRUCTLIMIT

2.111.6

2014/11/17

  • mkwmap works with empty input

Bugfixes:

  • skell: fixed typo in jQuery

2.111.3

2014/10/21

  • 2x faster commonest_match.py

2.110

2014/09/21

  • added defaults for SIMPLEQUERY corpus directive; it is [A=”%s” | B=”%s”]
  • CQL supports different attributes in global conditions
  • CQL supports !within and !containing operators
  • genws: STRUCTLIMIT may be arbitrary CQL query
  • added mkregexattr for compiling regex dynamic attribute
  • new version of word sketch data files

2.110

2014/08/25

  • added jQuery UI javascript, css and images
  • added create_subcorpus() for arbitrary CQL query
  • create_subcorpus() takes directly RangeStream instead of query
  • mksubc supports creating subcorpora from CQL query

Bugfixes:

  • fix parws lexicon verification for new style TRINARY templates

2.109.8

2014/08/13 Bugfixes:

  • fix build with gcc 4.4

2.109.7

2014/07/28

  • parws: use single batch for TRINARY and COLLOC gramrels
  • compilecorp honours TMPDIR environment variable

Bugfixes:

  • mkvirt: fix freqs computation overflowing at int size
  • genngr: fix maximum allowed corpus size to 231-2

2.109.6

2014/07/01

  • genws: set COLLOC lexicon hash size to 500k items
  • printer icon shall be part of NoSkE

Bugfixes:

  • corpquery: fix marking KWIC in output

2.109.2

2014/06/18

  • compilecorp does not assume “word” attribute existence
  • corpquery does not assume “word” attribute

2.109

2014/06/16

  • MAXKWIC restriction placed into Concordance

Bugfixes:

  • fixed a bug in selecting gramrels

2.108

2014/06/13

  • added new dynamic function ascii for transliteration
  • mkwmap reserves file descriptors for joined set of files
  • Corpcheck checks if file “sizes” exists in PATH
  • changed support mail

2.107

2014/04/16

  • compilecorp support for bilingual dictionaries
  • added MAXKWIC size for KWICLines, defaults to 100

2.106

2014/02/27

  • added corpcheck utility for checking corpora sanity
  • added wsdump script for dumping of word sketches

2.103

2014/02/09

  • added sconll2sketch and sconll2wmap
  • compilecorp support for sketches from (S)CONLL

2.97

2013/12/28

  • mkdynattr: fix dynamic structure attributes of virtual corpora
  • mkstats support for n-grams on subcorpora

2.96

2013/11/10

  • added dumpthes — simple dumping of thesaurus
  • CQL support for similarity search in thesaurus

2.95

2013/11/03

  • added new dynamic function utf8capital
  • added new dynamic function utf8uppercase

2.94

2013/11/01

  • added new dynamic function getnbysep
  • fix mkvirt failing if virtdef contains single corpus

2.92

2013/10/23

  • encodevert compiles dynamic structure attributes
  • support for complement subcorpora

2.87

2013/09/29

  • faster implementation of frq and docf computation
  • choose first non-dynamic attribute as default DEFAULTATTR
  • mkvirt accepts attribute list via -a option
  • added devirt script for corpus devirtualization
  • added parencodevert script for parallel corpus encoding
  • redesign of mksubc and (sub)corpora statistics creation
  • corpus configuration file may not end with a new line
  • faster computation of ARF + ALDF

2.86

2013/08/14

  • full support for atributes of structures in virtual corpora
  • genws reports progress with -p option

2.85

2013/08/07

  • fix segfault when opening a virtual corpus with unavailable virtdef
  • mkvirt automatically creates dynamic attributes
  • virtdef file may contain ‘$’ for segment end being corpus end position
  • fix corpinfo so that it dumps valid configuration file format
  • added mksizes script for compiling sizes
  • compilecorp support for creating word sketch hashes

2.84

2013/06/06

  • compilecorp accepts –parallel=N option (number of parallel jobs)
  • compilecorp support for virtual corpora
  • mksubc writes detailed progress only with –debug
  • added CQL for range of positions, e.g. #20-50
  • CQL frequency function accepts values over 231
  • implemented CQL for word sketch seeks
  • added CQL support for querying word sketches by triples
  • CQL supports new positional functions “swap” and “ccoll”

2.83.3

2013/06/05

  • FIX: fix missing throw statements for create_subcorpus() in SWIG API
  • FIX: fix evaluating empty concordance collocation

2.83.2

2013/05/26

  • FIX: fix SEPARATEPAGE name being trimmed on first white space
  • FIX: Fix mksubc compiling only the 1st subc in subcdef

2.83.1

2013/05/10

  • FIX: collocation computation for window crossing beg/end of corpus

2.83

2013/05/10

  • enable multiple subsequent shuffling

2.82

2013/04/20

  • mksubc support for n-grams, may take .subc file, may take attribute list

2.81

2013/04/12

  • added url2domain dynamic attribute

2.80.1

2013/04/03

  • FIX: utf8_tolower failing for empty strings and unallocated buffer

2.80

2013/04/02

  • faster sample generation
  • ngrsave supports encoded corpus as input

2.79

2013/03/21

  • added utf8getlastn() dynamic attribute function
  • FIX: SEPARATEPAGE with DUAL TRINARY

2.78

2013/03/07

  • Concordance exports corpus object into SWIG API

2.77

2013/03/06

  • lscbr and ngrsave are more user friendly

2.76.1

2013/02/27

  • FIX: bulding with gcc >= 4.7

2.76

2013/02/26

  • added support for structures in virtual corpora

2.75

2013/02/24

  • Frequency distribution does not need Concordance to be computed

2.74

2013/02/18

  • support DUAL TRINARY word sketch grammatical relations
  • added getfirstbysep internal function for dynamic attributes
  • added Setswana locale settings
  • added dumpwmrev for dumping ws delta rev files

2.73

2013/02/04

requires finlib 2.21

  • implemented exact KWIC matching in filtering

2.72

2013/01/29

  • support for aligned segment contexts

2.71

2013/01/11

  • genhist enhancements

2.70

2013/01/08

  • compilecorp compiles subcorpora right after the main corpus

2.69

2012/12/10

  • export Corpus::get_confpath() into SWIG API

2.68

2012/11/29

  • parallel corpora API modifications
  • FIX: a number of fixes for processing parallel corpora

2.67.2

2012/11/26

  • FIX: a number of fixes for processing parallel corpora

2.67.1

2012/11/26

  • FIX: set default ALIGNSTRUCT to “align”

2.67

2012/11/17

  • compilecorp compiles alignment for parallel corpora
  • added a number of helper scripts for processing alignment
  • FIX: a number of fixes for processing parallel corpora

2.66

2012/11/15

  • updated licensing information
  • FIX: a number of fixes for processing parallel corpora

2.65

2012/11/09

  • enhanced support for processing of parallel corpora
  • FIX: sync() concordances if necessary before next operations

2.64

2012/11/08

  • NGram API changes
  • FIX: genngr failing to process corpora over 2G

2.63

2012/08/31

  • FIX: estimating word sketch multiword collocations positions

2.62.1

2012/08/30

  • FIX: allow LEXICONSIZE to increase memory usage

2.62

2012/08/27

  • encodevert accepts -d to prevent compiling dynamic attributes
  • FIX: filling default value for attributes of TYPE “UNIQUE”
  • FIX: mkdynattr takes LEXICONSIZE from corpus configuration

2.61

2012/08/17

  • support for asynchronous multi-threaded concordance computations
  • FIX: setting default attribute when querying parallel corpora

2.60.1

2012/07/18

  • FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
  • parws can check gramrel lexicon consistency

2.60

2012/07/10

  • support labels in the second argument (right-hand side) of within/containing, e.g. (<s/> containing 1:[] 2:[]) & 1.tag=2.tag
  • FIX: build with ruby 1.9

2.59.1

2012/10/24

  • bugfix release for the stable branch
  • FIX: build with ruby 1.9
  • parws can check gramrel lexicon consistency
  • FIX: fix race conditions in parallel computation of sketches with *TRINARY gramrels involved
  • FIX: fix filling default value in unique attribute
  • parws supports Python >= 2.4
  • documentation included in the distribution tarball
  • FIX: CQL: fix default attr setting for parallel corpus
  • FIX: fix static build with finlib
  • FIX: fix overflow on appending to a .text file larger than 4 (232) GB
  • FIX: finlib: fix build with gcc down to 4.1.2 at least

2.59

2012/06/29

  • new internal function for dynamic attributes “getlastn” for extracting last n characters
  • WMap support for access to the dictionary created by *COLLOC directives

2.58

2012/06/25

  • compatibility with ANTLR 3.4 C runtime
  • hashws support for subcorpora
  • more verbose output of encodevert by default
  • FIX: closing structures at the end of compilation

2.57

2012/06/08

  • WMAP support for collocation index operations incl. COLLOC directives

2.56

2012/06/06

  • added fixcorp script for fixing corrupted indices
  • support for extracting terms lexicon of word sketches

2.55

2012/05/29

  • support filtering multiword sketches by gramrels

2.54.1

2012/04/30

  • FIX: minor fixes for nested structures

2.54

2012/04/20

  • faster evaluation of non-regex matching using == and !== operators
  • FIX: utf8 lowercasing may have failed under specific circumstances
  • FIX: dynamic attributes are cleared before recompilation

2.53

2012/04/16

  • enhanced frequency distribution of nested structures

2.52

2012/04/05

  • maximum allowed nested structures set to 100

2.51

2012/03/14

requires finlib >= 2.17

  • support for handling of unique attributes

2.50

2012/03/05

requires finlib >= 2.16

  • first support for multiword sketches

2.49

2012/02/29

  • FIX: fix mishandling default encoding value in wmap API
  • support extracting terms from word sketches in API

2.48

2012/02/22

requires finlib >= 2.15

  • support for attribute values occurring more than 4G (232) times
  • support for extracting terms from word sketches

2.47.1

2012/02/18

  • FIX: fix encodevert segfaulting when run with -x

2.47

2012/02/08

requires finlib >= 2.14

  • support for lexicon size up to 4G (232 bytes)
  • FIX: concordance first-letter pagination in case of multibyte characters
  • FIX: mksubc does not fail on invalid attributes and empty subcorpora

2.46.1

2012/02/01

  • FIX: case-insensitive frequency distribution of utf8 corpora
  • FIX: do yet more tolerant Unicode conversion failure handling

2.46

2012/01/25

  • added indices of lexicon by sorted frequency
  • FIX: encodevert handles absent structure attributes properly
  • FIX: subcorpora contained first document range duplicated under specific circumstances

2.45.2

2011/12/08

  • FIX: parallelization of sketches with m4 definitions or dual gramrels
  • FIX: mkwmap correctly handles empty streams when joining, does not write zero counts

2.45.1

2011/10/20

  • FIX: do more tolerant Unicode conversion failure handling

2.45

2011/10/07

requires finlib >= 2.13

  • more descriptive CQL error messages
  • support for Unicode input/output using manatee.setEncoding()
  • automatic memory handling of Python objects
  • encodevert, genws and mkwmap logs timestamp with each message
  • prevent writing structures overflowing 32bit integer
  • 32to64.py correctly handles multiple overflows and overflows between begin and end
  • parallel computation of word sketches

2.44.1

2011/09/17

  • FIX mkwmap: fixed join phase if partial join is bigger than 4GB

2.44

2011/09/13

  • MAXDETAIL defaults to MAXCONTEXT if not set in the configuration file

2.43

2011/09/09

  • MAXCONTEXT set to 100 by default

2.42.1

2011/09/07

  • FIX: CQL evaluation in case concatenation subquery is empty

2.42

2011/08/31

  • mksubc prints progress on standard output
  • mksubc does not fail if DOCSTRUCTURE does not exist

2.41

2011/08/05

  • compilecorp automatically runs mknorms to perform proper normalization per structure attribute
  • mknorms support corpora over 2G

2.40.2

2011/08/04

requires finlib >= 2.12.4

  • fix ordering of nested structures in concordance

2.40.1

2011/07/29

  • FIX: extending concordance KWIC fixed for |kwic|>1 or KWIC interleaved with colloc

2.40

2011/07/28

  • intelligent autodetection of attribute locale

2.39

2011/06/28

  • support for excluding KWIC from collocations
  • FIX: CQL evaluation: [attr=”non-existing”]? [attr=”existing”] returned empty result instead of “existing” occurrences
  • FIX: mksubc command failed to compute document frequencies on new subcorpus

2.38.2

2011/06/10

  • FIX: encodevert support for memory-only corpora over 2GB

2.38.1

2011/06/02

  • FIX: frequency distribution failing if case-insensitiv/retrograde

2.38

2011/05/12

  • CQL allows ‘<struct #N>’ and ‘<struct !#N>’ for matching N-th struct
  • corpquery can sort results using GDEX and set default attribute
  • improved display of concordance reference
  • support for storing corpora over 2GB in memory only
  • FIX: UTF-8 character counting and lower-casing

2.37.1

2011/05/05

  • FIX: count collocations only once per context

2.37

2011/04/30

  • maximum nesting of structures limited to 10 by default

2.36.1

2011/04/21

  • FIX: fix encodevert warning on nested structures printing corpus position instead of file line

2.36

2011/04/06

  • added parse2wmap for creating sketches from dependency input
  • fixed dirty cache after rebuilding sketches
  • fixed multiple memory leaks in SWIG API
  • fixed mkvirt failing if corpus directory is missing
  • changed default MANATEE_REGISTRY to /corpora/registry
  • mksubc needs much less memory

2.35

2011/03/15

  • fix locating of nested structures
  • support attribute-based pagination of concordances
  • prevent colisions of wmap and manatee in SWIG api
  • faster docf computation implemented in c++
  • support for virtual corpora

2.34.1

2011/03/13

  • faster docf computation (ca. 20 x)
  • show Manatee exception messages in Python

2.34

2011/03/05

requires finlib >= 2.12

  • compilecorp support for creating subcorpora
  • encodevert automatically closes too many nested structures
  • mksubc computes frequency in documents into .docf files
  • changed format of word sketch .rev file — added support for collocations
  • export exceptions into SWIG API
  • regexp2ids takes voluntary filter pattern argument

2.33.2

2011/02/28

  • FIX: compilecorp computes sizes for corpora without structures
  • FIX: encodevert creates data dir with mode 755 instead of 751

2.33.1

2011/01/20

  • FIX: ngrsave: added NGRAM_SIZE and IGNORE_PUNC parameters

2.33

2011/01/11

  • compilecorp precomputes file with token, word, doc, paragraph and sentence counts

2.32.2

2010/11/24

  • FIX: encodevert looping on input containing NULL byte

2.32.1

2010/10/31

  • FIX: “STRUCTLIMIT s” generates <s/> instead of deprecated <s>

2.32

2010/10/27

requires finlib >= 2.11

New Features:

  • enhanced corpquery script which makes it possible to specify (via command-line options) reference attribute, context, limit for the number of results and structures and attributes to be printed
  • new parse2wmap tool for generating sketches (data for wmap) from a positional attribute
  • ngrsave can now print document IDs of duplicate n-grams instead of n-grams and number of documents
  • after the compilation, compilecorp checks for temporary files that indicate an error
  • enhancements to the CQL:
    • new “==” and “!==” operators that perform a match against fixed string (i.e. not a regular expression)
      Note that with two exceptions of “\”” and “
      ” no expansions are performed on the string.
      Examples:
      “.”, “$”, “~” matches a single dot, dollar sign and tilda, respectively,
      “\n” matches a backslash followed by the character n,
      “\”
      ” matches a double-quotes character followed by a single backslash
    • a meet/union query can occur at any position in the query and they are not introduced by the “MU” keyword, which is deprecated and raises an error
    • old within <str> syntax has been already deprecated (in favor of consistent within <str/> and now raises an error as well
    • support for inequality matching using new operators: “<=”, “!<=”, “>=”, “!>=”. The comparison on a string is performed in a way that compares numeric parts numerically and alphabetical parts alphabetically. Examples:
      [word>="cake"] matches “cake” as well as “came”,
      <doc id<="145UA01"> matches e.g. 145UA01, 143UA01, 145TA00 etc.
    • meet/union queries can use numeric labels and be subject to global conditions as any other query parts — e.g. (meet 1:[] 2:[]) & 1.tag = 2.tag;
    • a frequency function (denoted simply as f) can be used as part of the query together with numeric labels — e.g. 1:[] & f(1.word) >= 1000;

Bugfixes:

  • encodevert -v works again
  • encodevert can again read piped input data (“| <command>” in VERTICAL in corpus configuration file)
  • CQL queries using parallel corpora notation work again
  • UTF-8 support in regular expressions
  • encodevert doesn’t crash if no attributes are given in the configuration fail nor command-line

2.31.3

2010/10/27

  • FIX: Computing frequency distribution of multivalue attributes
  • FIX: Encodevert warns if there are are opened structures at the of the compilation — this always indicates an error and in case of nested structures leads to significant performance loss.

2.31.2

2010/08/04

  • FIX: compilecorp fails because of genhist.py which should be genhist
  • FIX: strip spaces in all attribute values
  • FIX: make dist* targets work again

2.31.1

2010/04/26

  • FIX: crash when MANATEE_REGISTRY=”” or config path is a directory

2.31

2010/04/23

requires finlib >= 2.10

New Features:

  • support for nested structures

Bugfixes:

  • fixed displaying of empty collocations

2.30

2010/04/15

New Features:

  • “===NONE===” used as attribute default DEFAULTVALUE

Bugfixes:

  • fixed displaying concordance with empty nodes

2.29.1

2010/04/10

  • FIX: typo in CQL parser causing the build to fail with C locale

2.29

2010/04/07

New Features:

  • compilecorp script for complex handling of corpus and sketch compilation

Bugfixes:

  • unfinished corpus data reports size 0, does not crash

2.28.1

2010/03/11

  • FIX: encodevert limits its memory usage to available physical memory

2.28

2010/01/19

requires ANTLR3.2 or higher

New Features:

  • allow ${attribute} substitution in DISPLAYBEGIN/DISPLAYEND
  • CQL enhancements:
    • support for “<query> within <query>”
    • “containing” as dual option to “within”
    • enable meet/union query after within/containing
    • support for “within NUMBER”

Bugfixes:

  • fixed mkwmrank on empty wmaps

2.27

2010/01/11

New Features:

  • gcc 4.3 and 4.4 compatibility
  • ANTLR 2.7.2 compatibility
  • Python API scripts now part of the distribution

[…]

2.14

  • corpus size more than 2 billion tokens

1.99

  • bug fixes in query evaluation, build

1.94

  • first public version