Text analytics with Sketch Engine

The Sketch Engine software is a comprehensive suite of text analysis tools designed to handle texts in many languages and scripts with a size of billions of words. The analysis takes into account the linguistic features of each language such as morphology or grammar and is suitable for various text analysis techniques.

Text analysis API

All functionality is also available via the Sketch Engine text analysis API.  To test the different functionalities, register a free trial account.

Text analytics API

All Sketch Engine accounts come with API for text analysis that supports the complete Sketch Engine functionality.

Topic modelling

Keyword frequency, term extraction and term frequency will be useful for topic modelling by identifying words and phrases typical for the content of the text. Our API supports this topic modelling.

keywords - text analysis

The result of processing texts about digital photography.

Frequency

Calculating word frequency is a frequent task in text analytics. Sketch Engine contains tools to calculate frequencies of words, phrases, n-grams as well as grammatical or lexical structures, e.g. the frequency of verbs in the past tense as compared to the present tense. Word frequency is included in our API.

Word frequency

The wordlist tool will calculate word frequency with plentiful filtering options such as words starting, containing or ending in a particular way or list of nouns, verbs and other parts of speech. Combining the criteria is supported as well as the use of regular expressions.

Phrase frequency

Frequency can be calculated using the concordance tool which will find all instances of words or phrases by using simple or advanced search options. The powerful CQL language and/or regular expressions can be used for complex queries involving word patterns and structures.

N-gram frequency

To analyse texts by looking at multiword expressions, Sketch Engine will compute the frequency of n-grams of different sizes. Texts with a size of billions of words are supported.

Co-occurrence analysis (web or API)

Co-occurrence analysis reveals information about the context in which words appear and helps us understand how the core meaning of the word is modified. Co-occurrence analysis is supported by our text analytics API. This type of text analysis can be done by using the following tools:

Word sketch

A word sketch gives an at-a-glance one-page overview of the context in which the word appears. The context can be clearly understood from the collocations the word keeps.

Clustering

Word sketches support the clustering of collocations to group similar collocations and reveal topics these collocations represent.

word sketch - text analysis

Thesaurus

Automatic synonym identification produces a thesaurus entry for every word in the language. The algorithm exploits the theory of distributional semantics which says that words similar in meaning tend to appear in similar context. This produces an automatic thesaurus.

thesaurus - text analysis

Preparing data for text analysis and text analytics

The texts uploaded to Sketch Engine for text analysis and data mining are processed fully automatically into a corpus. The processing includes part-of-speech tagging and lemmatization that simplifies the analysis of texts in morphologically rich languages and increases the quality of the analysis by employing linguistic criteria.

  • upload text
  • Sketch Engine supports the upload of many text formats. The texts can contain metadata which will be recognized and can be used in the analysis.
  • text from web
  • The automated built-in tool for downloading relevant texts from the web will help build a multimillion-word corpus in a few minutes. The same procedure can be used for expanding the corpus.

Availability of the tools

After the texts are processed into a text corpus, the complete set of analytic tools (analyzers) becomes available.