Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a new family of web corpora created from IJS newsfeed created by Jozef Stefan Institute, Slovenia (Trampus et al 2004).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeed is available in many languages including

The project continuously processes 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. more on the project»

The Timestamped JSI web corpus was tagged for parts of speech and the time stamps were used to augment the corpus with diachronic annotation. Currently the corpus covers the time period of 2014 and 2016. By combining this data with other web corpora, a total period of between 2009 and 2015 can be covered. There are plans to receive regular daily updates from Jozef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.

Availability

The corpus is accessible to all users with a subscription plan and site licence members (not to trial users).

Timestamped JSI web Corpus

Arabic (250 million words)

Catalan (40 million)

Czech (146 million)

German (936 million)

English (8 billion) – already available in Sketch Engine

Finnish (75 million)

French (782 million)

Croatian (150 million)

Hungarian (77 million)

Italian (335 million)

Korean (130 million)

Dutch (176 million)

Polish

Russian (500 million)

Spanish (1.5 billion )

Serbian (38 million)

Swedish (147 million).