Neologisms and diachronic analysis of word usage

Trends is a feature for detecting words in a corpus which undergo changes in the frequency of use in time (diachronic analysis). Trends identify and order words based on their growing use (new words or neologisms) or decreasing use in the given period of time.

Lexicologists can use trends to identify new words (neologisms) and  historians can use trends to identify the point time when a word started to be used, stopped being used or when it saw an unusually increased use.

For a detailed description of the algorithms and comparison of their performance, see Ondřej Herman’s bachelor thesis.

Which corpora can be used?

Trends will only work with a corpus annotated for a time period. Currently, Trends can be used with e.g. following corpora:

Using Trends

If the diachronic analysis is available for a corpus, the left menu will contain the Trends link. This will take you to the Trends form.



the time period used for the analysis such as month, year or decade, the available values depend on the corpus annoation, some corpora only contain one value

(optional) select a subcorpus, display information about it or create a new one

select the attribute to be used in the analysis, the available values depend on the corpus annotation, some corpora only contain one value

the method used for calculating the results

Maximum p-value
maximum statistical significance

Sort by
set how the results should be ordered, sorting can also be done by clicking column headers on the result screen

Minimum overall frequency
filters out results with the frequency lower than the set value

Regular expression filter
only includes results that match the regular expression

Filter non-words
excludes tokens which are not words such as punctuation

Filter capitalized
excludes tokens which start with an uppercase letter, e.g. Google, Wi-Fi, GPS but will include iTunes or gps

You do not have to change any settings and click Compute trends.

Trends - diachronic corpus analysis

By default, the results are sorted by the absolute value of the change. This means that the words with the biggest change will be at the top irrespective of whether the change was positive (=growing usage) or negative (=decreasing usage).

Result screen description

(Words in bold refer to the screenshot above.)

clicking the word will bring up a frequency list for the word like this broken up by the time intervals (see details on the right)

Trends - frequency list

Frequency list broken down by the year, the time interval available depends on how the corpus is annotated, corpora can be annotated for smaller or bigger units such as months or decades.

P – positive filter – will show a concordance from the given period
N – negative filter – will show a concordance from all periods but not the one that was clicked
Frequency – absolute frequency
Rel [%] – relative text type frequency

the value indicating the degree of change

expresses the degree of change, see p-value

the frequency in the corpus (number of hits), clicking the number will bring up a concordance

Trends in detail

Configuring a corpus for use with Trends

Data have to be annotated with time stamps. The name of the time stamp attribute is user defined. The same corpus can contain several time stamp attributes. Analysis will then be performed on each of the attributes independently.

 <doc pub="1977">
<p style = "informal">
<s>some text </s>
<s> some text </s>
<p style = "informal">
<s>some text </s>
<s>some text </s>

The top of the corpus configuration file must contain a definiton of the attribute which contains the time stamp, for example:


A number of attribute values should not exceed 500. The values must be composed of the same number of characters, the longest time period must come first, the shortest last. Non-numerical characters are allowed but will be ignored when the values are interpreted.

Examples of valid values

All values within the same corpus must have the same format.


Invalid values

2004Mar14 – the month will be lost

If you have problems with setting things up, please ask for assistance.


p-value (statistical significance) – a function of the observed sample results that is used for testing a statistical hypothesis

slope (trend direction) – a number that describes the direction and the steepness of a line

Theil-Sen – a non-parametric method that finds the median slope among all lines designed by pairs of sample points

Linear regression – a method used for modeling the linear relationship between a dependent variable and one or more explanatory variables


Adam Kilgarriff, Ondřej Herman, Jan Bušta, Pavel Rychlý and Miloš Jakubíček. DIACRAN: a framework for diachronic analysis (presentation). In Corpus Linguistics (CL2015), United Kingdom, July 2015.

Ondřej Herman and Vojtěch Kovář. Methods for Detection of Word Usage over Time. In Seventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2013. Brno: Tribun EU, 2013, pp. 79–85. ISBN 978-80-263-0520-0.

Ondřej Herman (2013). Automatic methods for detection of word usage in time. Bachelor thesis. Masaryk University, Faculty of Informatics.

How to compute trends

A script “mktrends” is part of Sketch Engine. If you have a local installation, you can call the script directly in your command-line. Without parameters, it gives this information.

	EPOCH_LIMIT: attribute values with smaller norm are discarded
	LIMIT: words with fewer occurrences are skipped
	METHODS: comma-separated list of methods to compute
		possible values are: linreg_nonzero, mkts_all, mkts_nonzero, linreg_all
example: /usr/bin/mktrends bnc bncdoc.year word mkts_all 100 100000