Automatic Collocation Dictionaries
Onion (ONe Instance ONly) is a tool for removing duplicate parts from large collections of texts. 
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. 
chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
 Pomikalek, Jan. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk University, Faculty of Informatics (2011).