Sketch Engine can handle corpora made up of texts in languages which are not supported directly. The number of available features depends on the script the language uses. Scripts divide into whitespace scripts and non-whitespace ones.

A whitespace script separates words with a space, paragraph or a similar character appearing as whitespace on the screen or in print. Typical examples are languages written in Latin, Cyrillic or Arabic scripts. Many scripts of India also belong to this category.

A non-whitespace script is a script that does not use whitespace, typical examples are Chinese and Japanese. Texts in these scripts transliterated into whitespace scripts can make use of the same functionality as whitespace scripts.

Available features

whitespace scriptnon-whitespace script
tokenizationYES, with a universal tokenizerNO *)
POS taggingNONO
concordance searchYES at word level or character level, regex allowed
NO lemma search or POS search
YES but only at character level, regex allowed, a concordance for a string of characters can be generated, no other searches are available *)
can be calculated from a concordance or via word sketches
NO *)
word listsYESNO *)
Word SketchYES, universal word sketch grammar will be used, users can write their own word sketch grammar to suit their needsNO *)
thesaurusYESNO *)
Create corpus from the webYES **)YES **)

*) Texts in non-whitespace scripts tokenized using an external tool and then uploaded to Sketch Engine as a vertical file can make use of the same set of features as withespace scripts.

**) more information about creating corpus in an unsupported language form the web on the WebBootCaT page.

Is Sketch Engine suitable for my language?

The best way is to set up a free trial access, upload the texts and try how searching your corpus works. Please get in touch if you have any questions related to unsupported languages. Our support team will do their best to accommodate your requirements.