In Sketch Engine, subcorpora can be created by users in their own namespace, each user has their own subcorpora and cannot access subcorpora of other users.

To share common subcorpora, it is possible to create a list of subcorpora which are accessible by all users (so-called “global subcorpora”). The list of global subcorpora is defined in a subcorpus definition file. An example is here with instructions on the format provided at the start of the file:

###############################################################################
# Subcorpus definition file
###############################################################################
#
# Subcorpora can be created by users in their own name space,
# each user have own subcorpora and cannot access subcorpora of
# other users.
# To share common subcorpora, it is possible to create a list of
# subcorpora which are accessible by all users.
# This file defines subcorpora names and respective subqueries.
#
#
# Subcorpus definition format
# ----------------------------
# *FREQLISTATTRS attr1 attr2
#
# =subcorpus_id
#   structure
#   sub-query
#
# =subcorpus_id
#   -CQL-
#   full-cql-query
#
# FREQLISTATTRS specifies a list of attributes for which frequecy
# lists should be precomputed.
#
# Sub-query is a part of a corpus query which can be used in
# "within <structure>" clause.  It can consist of and/or combination
# of attribute-value pairs.
#
# Full-cql-query is any CQL query whose result (KWIC) is taken as subcorpus
# definition.
#
# All strings starting with # are comments and are ignored to the end of line.
#
###############################################################################

*FREQLISTATTRS word lemma lempos

=spoken
  bncdoc
  alltyp="Spoken context-governed" | alltyp="Spoken demographic"


=book60
  bncdoc
  alltim="1960-1974" & wrimed="Book"


=first1000
  -CQL-
  [#0-1000]


=same_as_book60
  -CQL-
  <bncdoc alltim="1960-1974" & wrimed="Book"/>

To compile the shared (global) subcorpora it is possible to use either the CA interface or a mksubc.py script.

1) via Corpus Architect interface

  • Once, you have created your subcorpus definition file, it is necessary to:

– upload the definition

  • go to the home page (corpora overview)
  • start by pressing Subcorpus definitions in the left-hand side menu
  • click on Add new subcorpus definition file at the bottom right
  • find and upload the definition file on your computer
  • fill in the name it should be referred to within Sketch Engine and click OK

Note that your uploaded definition files can be shared with other users. This allows the other users to compile subcorpora using your definition file or to view the file itself. This is *not* necessary for sharing the actual subcorpora you have compiled for a given corpus with other users.

– recompile the corpus

  • if you have uploaded a subcorpus definition file to the server or someone has shared their definition with you, open the corpus by clicking on its name (it works only on user corpora – not the preloaded ones)
  • select Set subcorpus definitions in the left-hand side menu (if the label is greyed, make sure the corpus is already compiled)
  • choose a definition file you want to use
  • tick the Recompile subcorpora checkbox and click OK
  • if the compilation finishes without any errors then all users that have access to the corpus will also see the newly created subcorpora

2) using mksubc.py script

Usage: mksubc.py CORPNAME SUBCORP_DIR SUBCORP_DEF_FILE

SUBCORP_DIR is a directory where the subcorpora will be created, this depends on the Sketch Engine installation. The global subcorpora (accessible by all users) should be stored in the directory set in the SUBCBASE attribute of the corpus config file, which is by default PATH/subcorp/.

Note that mksubc.py is run by compilecorp (see Compiling Corpus)