Automatic Collocation Dictionaries

English, Hungarian, Czech, Bulgarian, Croatian, French, Maltese,​ Polish, Serbian, Slovak, Spanish


Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. [1]


jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. [1]


chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.


unitok is a universal tokeniser. It works in general for languages using spaces to delimit words, plus it includes special models for more precise tokenisation of selected languages.


[1] Pomikalek, Jan. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk University, Faculty of Informatics (2011).

Kilgarriff, Adam; Milos Husak and Milos Jakubicek (2013). Automatic Collocation Dictionaries. In ​elex2013: Electronic lexicography in the 21st century, Tallinn, October 2013, pp.