Future directions and research agenda of Sketch Engine, LDA’15

We would like to invite you to attend Sketch Engine presentation at LDA’15 Language Data Analytics Workshop, held in EC Stoner Building 8.60, University of Leeds, on Wednesday 16 September, 10:00–14:00. Miloš Jakubíček will talk at 12:00.

More information.

Better Danish

Brexit Corpus

news: parallel corpora
free Sketch Engine for Learner Corpus Association members

N'ko corpus

XLIFF support in Sketch Engine
Sketch Engine CQL calendar

Calendar 2017

Sketch Engine and Colibri
Audio recordings for the British National Corpus (BNC)

BNC audio

improved functionality for Bulgarian text
improved Thai support

Difference in size per million when using Text Types vs. a subcorpus

Why is there a different frequency per million between making…

Prices for Academic Individual Users

[raw] map_period = {"year" : 12, "quarter"…

example 3 python

This is a Python example for basic HTTP authentication on local…

Example 3 java

This example will show you how to access the Sketch Engine API…

Dutch Web Corpus

This corpus was created within the Corpus Factory project as…

Croatian Web Corpus

(version 1.1) Tagset ​MULTEXT-East Morphosyntactic Specifications,…

Chinese Tagset

A preview of a Chinese tagset. 普通名词 n common…

CLAWS tagset - mapping file

C8 to C7 mapping file. NS 2011-5-14. APPGE -> APPGE: possessive…

Feed Corpus Project

FCP corpus aims to be a million word per day collection of POS-tagged…

Concordance Query: Error Query

When Error query is selected, you can search on the error code…

My jobs

My jobs (job runner) feature shows your long running tasks and…

The New Corpus for Ireland | Nua-Chorpas na hÉireann

[ezcol_1half] The New Corpus for Ireland – user’s guide Welcome…

TatarWaC corpus

Tatar sample corpus is ca 200 thousand words crawled from the…

API Documentation

Sketch Engine JSON API, methods and attributes The communication…

Icelandic sample corpus

This is a small corpus of Icelandic texts prepared for the Sketch…

General instructions on corpus data directory structure

The aims of these instructions is to ensure that for every corpus,…

Renaming Sketch Grammar relations

CD to directory which contains the compiled corpus files. cd…

Adding sentence boundaries to a compiled corpus

This document explains how structures, such as documents, paragraph,…

Sketch Grammar development corpora

This page describes how to use a sketch grammar in your corpus. In…

Compatibility Matrix

This page provides compatibility matrix of Sketch Engine components…

Uploading multiple files to Sketch Engine

Sketch Engine allows users to build corpora from their own documents.…

Sketch Engine API for IntelliWebSearch

Sketch Engine is a corpus manager tool offering many corpus linguistics…

Preloaded Configuration Templates

When you create a corpus from the Sketch Engine interface (see…

Building sketches from parsed corpora

Introduction Sketch Engine generates word sketches usually using…

Word Sketches definition files

The following files can be used for building word sketches in…

Word Sketch Index Format

This page is a brief overview of the development of the word…

Highlight Only Part of a Complex Query

I want to align a concordance accoding to a part of the query.…

Search Punctuation

To search for punctuation as well as words: Insert the punctuation…

Compare corpora using word lists

To compare two preloaded corpora Open the focus corpus and…

Distinguish Between Lemmas

To look at different lemmas with the same spelling but different…

How do I…?

This page lists possible tasks that a Sketch Engine user might…

Sketch Engine Localisation

The Sketch Engine interface can be translated into any other…

JSON API Documentation

Using JSON JSON (JavaScript Object Notation, http://www.json.org/)…

JSON API - creating query

Sketch Engine uses HTTP REST API. All API methods (unless stated…

JSON API - authentication

Authentication Authentication is an optional feature that can…

API Documentation examples

This page provides links to various API scripts that show how…

Full Administration

This feature is available only for local installations (see the…

Text Types, Headers and Subcorpora

Overview When studying a word, phrase, or grammatical construction,…

Preparing Corpus Text

The input format is "vertical" or "word-per-line (WPL)" text,…

czes corpus

CZES is a Czech corpus consisting of newspaper articles and magazine…

TalkBank Persian

The TalkBank Persian corpus contains blog posts to various Farsi…

TED_en corpus

A corpus of transcripts of TED talks. Prepared by Akshay Min…

jpTenTen11 LUW corpus

Japanese TenTen corpus gathered from the web in December 2011.…

Scottish Gaelic Wiki corpus

Scottish Gaelic Wikipedia corpus. Downloaded in February 2015.…

pukWaC

The same as ukWaC, but with a further layer of annotation added,…

Romanian WaC (RoWaC) corpus

This Romanian web as corpus was gathered by Monica Macoveiciuc,…

Polish Web Corpus (PolishWaC)

Polish web as corpus has 103 million words and the encoding is…

Parallel Corpora Registry Info

General Attribute Set ATTRIBUTE word STRUCTURE s{ ATTRIBUTE…

Islam – UK

A special English newspaper corpus by Costas Gabrielatos at…

Internet-ZH corpus

Internet-ZH is a Chinese web corpus collected by Serge Sharoff.…

Project Gutenberg Corpus

downloaded with wget: getting Gutenberg cleaned with…

Fryske Akademy Parallel Corpus

Frisian and Dutch not POS tagged aligned sentences Dutch…

French Web Corpus (WaC)

This corpus (web as corpus) was gathered using a list of URLs…

Estonian Reference Corpus

Estonian Reference Corpus is a morphologically annotated corpus…

ChineseTaiwanWaC corpus

Chinese Taiwan web as corpus has almost 260 million words encoded…

MalaysianWaC corpus

The corpus is prepared by Corpus factory method. Full details…

NepaliWaC corpus

Nepali web corpus downloaded by LCL on Dec 10, 2014. ~1200…

SamoanWaC corpus

Web corpus of Samoan. Created by Bharat Ram Ambati using corpus…

SetswanaWaC corpus

(version 2) The corpus is prepared by Corpus factory method.…

SpanishWaC corpus

This corpus was gathered using a list of URLs provided by Serge…

SwedishWaC corpus

The corpus is prepared by Corpus factory method. Full details…

SDeWaC corpus

SDeWaC is a subset of DeWaC. The creation of sDeWaC is described…

WelshWaC corpus

The corpus is prepared by Corpus factory method by Anil in October…

ThaiWaC corpus

The corpus is prepared by Corpus factory method. Full details…

TurkishWaC corpus

The TurkishWaC corpus is a 32 million word collection of samples…

UKWaCsst corpus

UKWaC tagged with SuperSenseTagger (​sst-light) described in…

DANTE: A Detailed, Accurate, Extensive, Available English Lexical Database

Here we present some sample queries on the database and corresponding…

GujarathiWaC corpus

FrWac web as corpus is a corpus of Gujarati language (Indo-Aryan…

Patakis corpus

Patakis is a 100 million word collection of POS-tagged texts…

GeorgianWaC corpus

Original file owner: bharat.

FinnishWaC corpus

Finnish web as corpus.

FrisianWaC corpus

Frisian web as corpus was crawled in August 2013. It is a corpus…

danishWaC corpus

The corpus prepared by Corpus factory method. It has 288 million…

Domain Specific Corpora

These corpora are prepared from specific domains, e.g. science,…

ScienceBlog corpus

The ScienceBlogs corpus is a selection of posts and comments…

e-flux corpus

The e-flux corpus is a web corpus of English art news digests.…

Environment corpus

English environment related web corpus. Crawled by SpiderLing…

Filipino web corpus (FilipinoWaC)

The corpus was created by Anil in October 2013. It has almost…

Arabic web corpus (WaC)

Arabic web corpus was created by Serge Sharoff and was tagged…

Nineteenthcentury corpus

Actually, the 19th century corpus is only available to Osnabrück…

Penn Historical Corpora

Penn Historical Corpora is a collection of historical English…

A Corpus of English Dialogues 1560–1760

‘Released in Spring 2006, A Corpus of English Dialogues 1560–1760…

Clustering

Clustering can be performed in Sketch Engine on the similar…

Manual for GDEX

To quickly start using Good Dictionary EXamples, see the GDEX…

Syntax of GDEX configuration files

GDEX configuration files are written in YAML (Wikipedia.org).…

COMPAS corpus

The COMPAS is a corpus with about 100 million words which was…

BulgarianNC corpus

Bulgarian National Corpus (see the website of Institute for Bulgarian…

Argamon corpus

The current Argamon corpus contains blog posts to various Farsi…

Algemeen Nederlands Woordenboek (ANW) corpus

The Algemeen Nederlands Woordenboek (ANW) corpus is a balanced…

Dynamic Attributes

To make use of dynamic attributes they have to be set up in …

Allowed language names in corpus configuration

A Afar, Abkhazian, Adyghe, Afrikaans, Aghem, Akan, Amharic,…

Corpus Factory Method

A method for developing large general language corpora which…

New Model Corpus

The New model Corpus is a ~100 million words domain corpus built…

LEXMCI

The 1.7 billion word LEXMCI corpus of English was created by…

London English corpora

The corpus consists of transcripts of informal conversation-like…

Corpus configuration example

If your vertical text contains only words and no annotation,…

Corpus Configuration File: Overview

For the software to be able to use a corpus, there are a number…

Preparing a Text Corpus for Sketch Engine: Overview

This page describes how to prepare a text corpus for indexation…

API documentation home

Learn how to work with the Sketch Engine HTTP REST API. Here…

API documentation for keyword extraction

On this page we describe API of Corpus Architect which can be…

Sketch Engine Video Tutorials

All videos are accessible also on our YouTube channel. Please…

Compiling corpus

You need to prepare a vertical and registry file before compiling…

Discrepancies between API and interface results

When you query a corpus in the web interface you may notice that…

Common corpus structures

It is generally practical to divide a corpus into smaller parts…

Scripts for adding header fields

Adding attributes is based on mapping existing structure attributes…

Variation in hit counts

It often seems like you have got a different hit count for the…

caTenTen corpus

Catalan TenTen web corpus crawled in February and March 2014. Structural…

Adam Kilgarriff: Structured bibliography

(note: written by himself to 27th April 2015; see also the Wikipedia…
SkE research

Research Agenda

Lexical Computing's research interests lie at the intersection…

Word Sketch highlights

If a noun is usually in the plural, or a verb is usually in the…

Adam's blog

Happy New Year!