Timestamped JSI web corpus

created from Jozef Stefan Institute Newsfeed

The Timestamped JSI web corpus is a new family of web corpora created from IJS newsfeed created by Jozef Stefan Institute, Slovenia (Trampus et al 2004).

JSI web corpus is a clean, continuous, real-time aggregated stream of semantically enriched news articles from RSS-enabled sites across the world. The newsfeed is available in many languages (see the info box).

The project continuously processes 75,000 RSS feeds which bring between 100,000 and 150,000 articles every day. more on the project»

The Timestamped JSI web corpus was tagged for parts of speech and the time stamps were used to augment the corpus with diachronic annotation. Currently the corpus covers the time period of 2014 and 2016. By combining this data with other web corpora, a total period of between 2009 and 2015 can be covered. There are plans to receive regular daily updates from Jozef Stefan Institute and regularly amend the corpus with the latest data.

The diachronic annotation is extremely valuable in connection with Sketch Engine and its trends feature. The trends feature analyses the frequency of the use of a word in time by comparing the frequency of use across a series of comparable time periods.

Availability

The corpus is accessible to all users including trial users.

Tools to work with the Timestamped JSI Web corpora

A complete set of Sketch Engine tools is available to work with Timestamped JSI Web corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trends – diachronic analysis automatically identifies neologisms and changes in use

Search the Timestamped JSI web corpus

Sketch Engine offers a range of tools to work with the Timestamped JSI web corpus.

or

Timestamped JSI web Corpus

Arabic (976 million words)

Catalan (99 million)

Croatian (150 million)

Czech (289 million)

Dutch (401 million)

English (18 billion)

Finnish (119 million)

French (1,87 billion)

German (1,98 billion)

Hebrew (111 million)

Hungarian (180 million)

Italian (1,33 million)

Korean (438 million)

Polish (157 million)

Portuguese (1,1 billion)

Russian (1,12 billion)

Serbian (86 million)

Spanish (4 billion)

Swedish (335 million).

Other text corpora in Sketch Engine

Sketch Engine offers 350+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.