On this page we describe API of Corpus Architect which can be used for extracting keywords from a given text. The method uses reference corpus so it is neccessary to provide also language of the text. There are several parameters which can control the process of extraction of keywords.

As response, JSON object is returned. It is dictionary which contains keys keywords, error, length and ref_corp. If a keyword list is returned, then value of keywords key is array of triplets (arrays) in format [word, frequency, keywordness_score], length contains number of tokens in the text and ref_corp contains id of corpus which was used for extracting keywords. If an error is encountered during processing, the array is empty and error key contains erorr message. Frequency stands for number of occurrences in the reference corpus. Keywordness score =(<frequency in the text> + <simple maths parameter>) / (<frequency in the reference corpus> + <simple maths parameter>). It roughly expresses relevance of the word in the text compared to a general text in the same language. The keywords list is sorted by keywordness in descending order.

! You are allowed to make only POST requests.

 

API parameters

Parameters are given to POST request using JSON dictionary where parameters are keys and their values as dictionary values.

Text (text)

One of three obligatory parameters. The text should be in UTF-8. Since only POST requests are supported, there is no limit for its length.

There is no default value for this parameter.

Passphrase (passphrase)

Now, authentication, is done via supplying a passphrase which must be assigned to a customer which wants to use this API. In this case, we don’t need to add a new user account to the Sketch Engine.

There is no default value for this parameter.

Language (language)

You must also provide language of the text. Now, we support several major languages for which we have reference corpora. It is useful to know language of the text for tokenization of it, too. Allowed values are:

  • english
  • spanish
  • german
  • czech
  • arabic
  • chinese-simplified
  • french
  • hindi
  • indonesian
  • italian
  • japanese
  • persian
  • portuguese
  • russian

Default value = english.

Simple math parameter (simple_maths_n)

If you give a low number e.g. 1 you will get lower frequency keywords, whereas a higher N will get higher frequency keywords (for further details see Simple Maths). Values are natural numbers, but usually 1, 10, 100 and 1000 are used.

Default value = 100.

Attribute (attribute)

Now, only word attribute is supported, otherwise, the input text would have to be lemmatized and PoS tagged which we can’t do for all supported languages. The only permitted value is word.

Default value = word.

Exclude stop words from keyword list (exclude_stop_words)

If you want to exclude stop words like a, about, like, the, in, three, during, I, is, it, much, she, there and other very frequent words, use value true.

Default value = true.

Only alphanumeric characters (alphanumeric)

If keywords in resulting list should contain only alphanumeric characters, use value true.

Default value = true.

At least one alphabetic character (one_alphabetic)

If keywords should consist from at least one alphabetic character, use value true.

Default value = true.

Minimal length of keywords (min_length)

With this parameter you may specify minimal length of keywords in the resulting list.

Default value = 1.

Minimal frequency of keywords (min_frequency)

You may limit keywords in list by their frequency in reference corpus.

Default value = 1.

Maximal number of keywords (max_keywords)

You may limit length of keyword list – only first N will be output.

Default value = 100.

Python example using simplejson and urllib2 modules

#!/usr/bin/env python
#coding=utf-8

import urllib2, simplejson

data = simplejson.dumps({
    'text': '''Some long text here...''',
    'language': 'english',
    'passphrase': '...passphrase...',
    'simple_maths_n': 10,
    'attribute': 'word',
    'exclude_stop_words': True,
    'alphanumeric': True,
    'one_alphabetic': True,
    'min_length': 3,
    'max_keywords': 10,
    'min_frequency': 5
    })

req = urllib2.Request("https://beta.sketchengine.co.uk/get_keywords/", data)
opener = urllib2.build_opener()
f = opener.open(req)
obj = simplejson.load(f)
if obj.get('error') == '':
    print 'Length:', obj.get('length', 0)
    print 'Reference corpus:', obj.get('ref_corp', '')
    for k in obj.get('keywords', []):
        print '%s\t%d\t%f' % tuple(k)
else:
    print 'Error encountered:', obj.get('error')

AJAX example (using jQuery library)

! Since modern browsers don’t allow put content from foreign domains (cross domain AJAX), you may spot a problem when calling the API from browser as follows:

XMLHttpRequest cannot load [URL]. Origin null is not allowed by Access-Control-Allow-Origin

For more info about this issue, read HTTP access control.

function get_keywords() {
    $.ajax({
        url: 'https://beta.sketchengine.co.uk/get_keywords/',
        async: true,
        beforeSend: function () { $('#output').text('Loading...'); },
        type: 'POST',
        data: JSON.stringify({
            'text': $('#text').val(),
            'passphrase': '...passphrase...',
            'language': $('#language option:selected').val(),
            'simple_maths_n': $('#simple_maths_n').val(),
            'attribute': $('#attribute option:selected').val(),
            'exclude_stop_words': $('#exclude_stop_words').is(':checked'),
            'alphanumeric': $('#alphanumeric').is(':checked'),
            'one_alphabetic': $('#one_alphabetic').is(':checked'),
            'min_length': $('#min_length').val(),
            'max_frequency': $('#min_frequency').val(),
            'max_keywords': $('#max_keywords').val()
        }),
        success: function (data) {
            $('#output').text(data);
        },
        error: function (data, textStatus, errorThrown) {
            $('#output').text(textStatus);
        }
    });
}

Add this function inside <script> tag and you may use this HTML code with form to call the function above:

<form>
  <textarea id="text"></textarea></td>
  <table>
    <tr>
       <td><label>Language:</label></td>
       <td><select id="language">
          <option value="english" selected>English</option>
          <option value="german">German</option>
          <option value="czech">Czech</option>
       </select></td>
    </tr>
    <tr>
       <td><label>Attribute:</label></td>
       <td><select id="attribute">
          <option value="word" selected>word</option>
          <option value="lemma">lemma</option>
          <option value="lempos">lempos</option>
       </select></td>
    </tr>
    <tr>
       <td><label>Simple math N</label></td>
       <td><input type="text" size="2" value="100" id="simple_maths_n" /></td>
    </tr>
    <tr>
       <td><label>Exclude stop words</label></td>
       <td><input type="checkbox" checked id="exclude_stop_words" /></td>
    </tr>
    <tr>
       <td><label>Only alphanumeric</label></td>
       <td><input type="checkbox" checked id="alphanumeric" /></td>
    </tr>
    <tr>
       <td><label>At least one alphabetic</label></td>
       <td><input type="checkbox" checked id="one_alphabetic" /></td>
    </tr>
    <tr>
       <td><label>Min. length</label></td>
       <td><input type="text" size="2" id="min_length" value="1" /></td>
    </tr>
    <tr>
       <td><label>Max. keywords</label></td>
       <td><input type="text" size="2" id="max_keywords" value="100" /></td>
    </tr>
    <tr>
       <td><label>Min. frequency:</label></td>
       <td><input type="text" size="2" id="min_frequency" value="1" /></td>
    </tr>
  </table>
  <input type="button" onclick="get_keywords()" value="Get Keywords" />
  <input type="reset" value="Clear form" />
</form>
    
<h3>Output:</h3>
<div id="output"></div>

First example (article “Software” from Wikipedia)

Keywords extracted from plain-text of this wiki page.

Length: 3204 Reference corpus: ententen

word frequency keywordness_score
software 157 225.188706
hardware 20 50.452344
microsoft 15 47.498688
programming 11 28.355465
application 21 27.617109
operating 13 27.261984
applications 13 22.428778
programs 16 21.760401
apis 6 19.720981
licence 8 19.362902
documentation 7 18.498769
windows 7 18.153837
data 23 17.982899
systems 14 17.858611
languages 7 17.659631
instructions 7 17.436944
testing 8 17.086467
user 10 16.932754
main 12 16.477188
bundled 5 16.199756
designing 5 15.081348
platform 6 14.930669
code 8 13.946362
usually 9 13.363117
libraries 5 13.104471
companies 9 12.934569
machine 6 12.726932
language 9 12.126843
article 9 12.073843
industry 8 11.075764
standard 7 10.992700
tools 5 10.370262
library 5 10.257384
instance 5 10.083171
operations 5 9.827445
include 9 9.136637
general 8 8.505278
specific 6 8.440993
users 5 8.343546
term 5 7.344945
design 5 6.714775
development 7 6.383602
word 5 5.993911
called 7 5.692213
different 8 5.488476
program 6 5.233687
used 9 4.436685
using 5 4.224718
use 9 3.588931
like 13 3.368841
time 5 1.190129

Second example: Carol’s Alice in Wonderland

The API extracts these keywords from the well-known fiction, got from The Gutenberg Project:

Length: 37592 Reference corpus: ententen

word frequency keywordness_score
alice 403 108.070754
queen 75 19.313309
turtle 58 16.052738
hatter 56 15.879250
gryphon 55 15.626327
mock 56 15.195047
herself 83 15.059740
rabbit 47 12.957416
duchess 42 12.123300
king 63 11.674443
dormouse 40 11.633408
gutenberg 33 9.776340
mouse 39 9.682684
said 458 9.388244
tone 40 9.314341
hare 31 9.130406
cat 37 9.110194
march 34 8.817291
caterpillar 28 8.383761
project 87 7.818975
voice 47 6.950982
began 58 6.877441
went 83 6.655782
little 127 6.639094
round 41 6.589788
dear 29 6.501213
replied 29 6.320019
looked 45 5.883069
foundation 25 5.784868
thought 74 5.480465
soup 18 5.404122
quite 55 5.359216
electronic 27 5.351227
cried 20 5.080358
hastily 16 5.059736
curious 19 5.017770
head 48 4.766795
dinah 14 4.724124
anxiously 14 4.615736
till 21 4.584180
minute 21 4.559626
donations 15 4.510763
dodo 13 4.447773
door 28 4.425246
jury 16 4.408554
moment 31 4.269236
pigeon 12 4.136826
white 30 4.039087
majesty 12 4.025759
mad 14 4.023274
cats 13 3.960416
cook 13 3.938081
archive 13 3.910604
footman 11 3.904012
works 33 3.874752
sat 17 3.839269
garden 15 3.832139

Support

In case of problems with or misfunctioning of the API, please, contact us at support@sketchengine.co.uk.