This documentation is in the form of Python examples. If you stitch the code snippets from this page together and replace placeholders in them, it should work like a charm. Report problems to support@sketchengine.co.uk


If you have your own files, you can create a new corpus using our API within just a few steps:

  1. authenticate yourself,
  2. create a new corpus for a given language,
  3. upload files and then
  4. wait for processing.

After these steps, you will be able to access your corpus with API as usual (see what you can do). Of course, the variety of available queries will depend on the language and the content (size) of the files. So let’s start. You will need a few Python modules and your API key which you can get here.

#!/usr/bin/python
import json
import requests
import time

auth = ('%username%', '%api_key%')
URL = 'https://the.sketchengine.co.uk/api'

Before creating a corpus, you need to know what language you will be using. Let’s stick with English for now.

r = requests.post(URL + '/corpora', auth=auth, data=json.dumps({
    'language': 'en',
    'name': 'api_test'
}))

You needed only two parameters: the language of the corpus and its name. Use ISO 639-1 language codes. The API provides also a list of all languages supported by Sketch Engine.
We recommend to use only ASCII (uppercase and lower case Latin) characters in corpus names.

All responses are in JSON, you will need corpus ID for the future calls, this way you get it:

corpus_id = r.json()['data']['id']

Now let’s upload some files. You need to provide their names, actual content and MIME type. Here’s an example.

files = {'file': ('testing.txt', open('/path/to/your/file/testing.txt', 'rb'), 'text/plain')}
r = requests.post(URL + '/corpora/' + str(corpus_id) + '/documents', auth=auth, files=files)

When you send files to the corpus, they are automatically processed which takes some time. Sketch Engine will return the current status as a response and you need to wait until the processing is done. The immediate status is right here in the response:

print r.json()['data']['status']

Statuses are:

  • TAGGING (initial processing of uploaded files)
  • COMPILING (indexing, computing word sketches, thesauri, wordlists, …)
  • EMPTY (if the files don’t yield any data; you will need to upload some other data)
  • READY (the corpus is ready to be compiled)
  • COMPILED (the corpus is ready to be queried)

So now you need to wait:

r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
status = r.json()['data']['status']
while status != 'READY':
    r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
    status = r.json()['data']['status']
    time.sleep(5)

Once the files are converted and tagged, the status of the corpus will be READY. And that’s time to run the compilation so you can query the corpus later. The compilation takes also some time so you need to wait again.

r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
status = r.json()['data']['status']
while status != 'COMPILED':
    r = requests.get(URL + '/corpora/' + str(corpus_id) + '/compilation', auth=auth)
    status = r.json()['data']['status']
    time.sleep(5)

Here you go! The status now should be COMPILED and you are free to use the corpus. Remember, that you will use corpname=”user/%username%/%corpname%” in your queries.

If the status is READY after running a compilation, it means that the compilation probably failed.

If you have any questions or need to report a problem, contact as at support@sketchengine.co.uk

Happy hacking!