The project continuously processes 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. For example, this means ca 1 billion words per month for English. more on the project»
The Timestamped JSI web corpus was tagged for parts of speech and the timestamps were used to augment the corpus with diachronic annotation. Currently, the corpus covers the time period of 2014 and 2018. By combining this data with other web corpora, a total period of between 2009 and 2015 can be covered. There are now regular monthly updates from Jozef Stefan Institute and regularly amend the corpus with the latest data.
The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.
The corpus version with texts from the years 2014–2016 is accessible to all users including trial users.
The corpus version with up-to-date texts from the year 2014 to now (monthly updated) is accessible to users with a regular subscription.