Sketch Engine has a dedicated interface for working with learner corpora. The interface allows the user to search by the error itself, by the type of error, by the error correction or by a combination of any of the aforementioned criteria.

In addition, any metadata included in the corpus can be used in the search and analysed to get information about how learner mistakes are distributed across age groups, proficiency levels, mother tongue, types of test tasks etc.

A correctly constructed learner corpus can provide answers to global questions such as:

  • what is the most frequent type of error
  • which age group makes most mistakes

as well as very specific questions:

  • are mistakes related to verb tenses more frequent at B2 or C1 level?

Search options available in the learner corpus search interfaceLearner corpus search interface example

Text types are dependent on the metadata included in the corpus.

Setting up a learner corpus for Sketch Engine

It is highly recommended that the corpus be uploaded as a vertical file according to the specifications on this page.

Please contact us if the data are in a different format. Our team will inspect your data and will advise or assist in converting and/or uploading your data.

Setting up a vertical file for a learner corpus

The errors and corrections are marked by subsequent segments, e.g.

 <err type="Typo">
cnoference      NN      cnoference-n
</err>
<corr type="Typo">
conference      NN      conference-n
</corr> 

means “cnoference” corrected as “conference”. The following structures are mandatory in the error corpora, as well as their proper closures , (this is because of nesting):

 <err> 

and

 <corr> 

The ‘type’ must be the same in both the error and the respective correction.
Both the error and the correction can be empty, but in this case, a special ‘===NONE===’ token must be inserted. For example:

 <err type="DeletedWord">
cnoference      NN      cnoference-n
</err>[[BR]]
<corr type="DeletedWord">
===NONE===      ===NONE===      ===NONE===
</corr> 

This means that the word “cnoference” was deleted by the corrector. The nesting works in a natural way. For example:

 international   JJ      international-j
conference      NN      conference-n
<err type="Repetition">
<err type="Typo">
cnoference      NN      cnoference-n
</err>
<corr type="Typo">
conference      NN      conference-n
</corr>
</err>
<corr type="Repetition">
===NONE===      ===NONE===      ===NONE===
</corr> 

This means that the word “cnoference” was first corrected as “conference” and then deleted because it is actually a repetition.