Sketch Engine has a dedicated interface for working with learner corpora. The interface allows the user to search by the error itself, by the type of error, by the correction or by a combination of any of the aforementioned criteria.

In addition, any metadata included in the corpus can be used in the search and analysed to get information about how learner mistakes are distributed across age groups, proficiency levels, mother tongue, types of test tasks etc.

A correctly constructed learner corpus can provide answers to global questions such as:

  • what is the most frequent type of error
  • which age group makes most mistakes

as well as very specific questions:

  • are mistakes related to verb tenses more frequent at B2 or C1 level?

Search options available in the learner corpus search interfaceLearner corpus search interface example

Text types are generated from the metadata included in the corpus.

Setting up a learner corpus

A learner corpus can be created from an error annotated text or an error annotated vertical file.

Please contact us if the data are in a different format. Our team will inspect your data and will advise or assist in converting and/or uploading your data.

Learner corpus from common text formats

For the learner corpus search interface to work correctly, annotate the errors and corrections with err and corr structures like this and upload the data in the usual way.

We attended a <err type="typo">cnoference</err><corr type="typo">conference</corr> in Rio last week. The weather <err type="tense">has been</err><corr type="tense">was</corr>very nice.

Learner corpus from a vertical file

If the corpus is to be uploaded as a vertical file, it should follow these specifications:

 <err type="Typo">
cnoference      NN      cnoference-n
</err>
<corr type="Typo">
conference      NN      conference-n
</corr> 

means “cnoference” corrected as “conference”. The following structures are mandatory in the error corpora, as well as their proper closures , (this is because of nesting):

Common requirements

The following structures are mandatory in the error corpora, as well as their proper closures , (this is because of nesting):

 <err> 

and

 <corr> 

The ‘type’ must be the same in both the error and the respective correction.

Both the error and the correction can be empty, indicating that a word was inserted or deleted by the corrector. A special ===NONE=== token must be inserted. For example:

 <err type="DeletedWord">
cnoference      NN      cnoference-n
</err>[[BR]]
<corr type="DeletedWord">
===NONE===      ===NONE===      ===NONE===
</corr> 

This means that the word “cnoference” was deleted by the corrector.

‘Double errors’ can be indicated by nesting the structures. For example:

 international   JJ      international-j
conference      NN      conference-n
<err type="Repetition">
<err type="Typo">
cnoference      NN      cnoference-n
</err>
<corr type="Typo">
conference      NN      conference-n
</corr>
</err>
<corr type="Repetition">
===NONE===      ===NONE===      ===NONE===
</corr> 

In the example above, the word conference was misspelt “cnoference” and also repeated so the corrector first corrected the spelling and then marked it as a deletion.

Visual style for error and correction

When you search in a learner’s corpus, the content of errors is usually rendered in red colour and the corrections in green colour. You may change it by using defining DISPLAYCLASS in the configuration file for the two appropriate structure definitions. E.g.

STRUCTURE err {
    DISPLAYCLASS "errclass"
}
STRUCTURE corr {
    DISPLAYCLASS "corrclass"
}

Then in CSS file (view.css) you may define the class and add styles you want:

.errclass {
    background-color: red;
    color: white;
    font-weight: bold;
}
.corrclass {
    background-color: green;
    color: white;
    font-weight: bold;
}