Text analysis API

All Sketch Engine accounts come with API for text analysis that supports the complete Sketch Engine functionality.

Text analysis with Sketch Engine

Sketch Engine is a comprehensive suite of text analysis tools designed to handle texts in many languages and scripts with a size of billions of words. The analysis takes into account the linguistic features of each language such as morphology or grammatical and lexical patterns.

Preparing data for text analysis

The texts uploaded to Sketch Engine for text analysis are fully automatically processed into a corpus first. The processing includes part-of-speech tagging and lemmatization that simplifies the analysis of texts in morphologically rich languages and increases the quality of the analysis by employing linguistic criteria.

  • upload text
  • Sketch Engine supports the upload of many text formats. The texts can contain metadata which will be recognized and can be used in the analysis.
  • text from web
  • The automated built-in tool for downloading relevant texts from the web will help build a multimillion-word corpus in a few minutes. The same procedure can be used for expanding the corpus.

Tools

After the texts are processed into a text corpus, the complete set of analytic tools becomes available.

Topic modelling

Keyword frequency, term extraction and term frequency will be useful for topic modelling by identifying words and phrases typical for the content of the text.

The result of processing texts about digital photography.

Frequency

Calculating word frequency is a frequent task in text analysis. Sketch Engine contains tools to calculate frequencies of words, phrases, n-grams as well as grammatical or lexical structures, e.g. the frequency of verbs in the past tense as compared to the present tense.

Word frequency

The wordlist tool will calculate word frequency with plentiful filtering options such as words starting, containing or ending in a particular way or list of nouns, verbs and other parts of speech. Combining the criteria is supported as well as the use of regular expressions.

Phrase frequency

Frequency can be calculated using the concordance tool which will find all instances of words or phrases by using simple or advanced search options. The powerful CQL language and/or regular expressions can be used for complex queries involving word patterns and structures.

N-gram frequency

To analyse texts by looking at multiword expressions, Sketch Engine will compute the frequency of n-grams of different sizes. Texts with a size of billions of words are supported.

Co-occurrence analysis

Co-occurrence analysis reveals information about the context in which words appear and helps us understand how the core meaning of the word is modified. This type of text analysis can be done by using the following tools:

Word sketch

A word sketch gives an at-a-glance one-page overview of the context in which the word appears. The context can be clearly understood from the collocations the word keeps.

Clustering

Word sketches support the clustering of collocations to group similar collocations and reveal topics these collocations represent.

Thesaurus

Automatic synonym identification produces a thesaurus entry for every word in the language. The algorithm exploits the theory of distributional semantics which says that words in similar in meaning tend to appear in similar context. This produces an automatic thesaurus.