Our new Spanish Word Skteches give a much better coverage of Spanish-specific phenomena such as compound verb tenses, verb constructions, ser/estar or esubjuntivo. Spanish collocation information has never been so rich.

decirnos, descargárselo, comerselo are examples of verbs with clitics which pose a problem when searching. Sketch Engine can now handle these much better, searching for decir will find instances with and without the pronouns, i.e. decir and decirle, diciéndo and diciéndole

This is available in the European Spanish Web 2011 (eseuTenTen11) or any newly created user corpora.

New Spanish Word Sketches

The new word sketch grammar now analyses Spanish specific phenomena for example

  • adjectives preceded by ser or estar
  • statistics of verb constructions (perífrasis verbales) in which a verb tends to appear
  • noun phrases using de
  • statistics of the subjunctive compared to the indicative

and many others.

See an example of the word sketch for apoyar (v) and claro (adj).

Availability

Available in the European Spanish Web 2011 (eseuTenTen11) or any newly created user corpora.

Upgrading your corpus

Previously created user corpora need to be upgraded and  re-compiled to bring in the new functionality. Start the re-compilation and you will be invited to upgrade the corpus during the process – watch out for a yellow message.

New clitics handling

Using the simple search option and typing a verb (dar, poner etc.) or a pronoun in its object form (me, le, nos, se…) will find instances of the verb with and without the pronouns (dar and also darse, dárselo…) or pronouns on their own as well as attached to a verb (se and also ponerse, ponerselo…). This is default behaviour in simple search.

In the CQL search, use:

[lemma="dar"]
[morphemes="se"]

to replicate the former and the latter example respectively.

To find verbs with attached pronouns se and lo, use:

[morphemes="se" & morphemes="lo"]

To find verbs with any attached pronouns, use:

[tags="V.*" & tags="PP.*"]

Note the use of tags, not tag.

New attributes for clitics handling

To enable this functionality, Sketch Engine uses two new multi-value attributes for Spanish:

morphemes – lists the morphemes which make up the token

tags – lists the tags related to the morphemes within the token

word formlemmatagmorphemestagsnotes
digodecirVMIP1S0decirVMIP1S01 token, 1 morpheme, 1 tag
decirledecirMN0000decir
le
MN0000
PP3CSD0
1 token, 2 morphemes, 2 tags
diselodecirVMM02S0decir
se
lo
VMM02S0
PP3CN00
PP3MSA0
1 token, 3 morphemes, 3 tags

Availability

Available in the European Spanish Web 2011 (eseuTenTen11) or any newly created user corpora.

Upgrading your corpus

Previously created user corpora need to be upgraded and  re-compiled to bring in the new functionality. Start the re-compilation and you will be invited to upgrade the corpus during the process – watch out for a yellow message.

Better Danish

Brexit Corpus

news: parallel corpora
free Sketch Engine for Learner Corpus Association members

N'ko corpus

XLIFF support in Sketch Engine
Sketch Engine CQL calendar

Calendar 2017

Sketch Engine and Colibri
Audio recordings for the British National Corpus (BNC)

BNC audio

improved functionality for Bulgarian text
improved Thai support

Difference in size per million when using Text Types vs. a subcorpus

Why is there a different frequency per million between making…

Prices for Academic Individual Users

[raw] map_period = {"year" : 12, "quarter"…

example 3 python

This is a Python example for basic HTTP authentication on local…

Dutch Web Corpus

This corpus was created within the Corpus Factory project as…

Croatian Web Corpus

(version 1.1) Tagset ​MULTEXT-East Morphosyntactic Specifications,…

Chinese Tagset

A preview of a Chinese tagset. 普通名词 n common…

CLAWS tagset - mapping file

C8 to C7 mapping file. NS 2011-5-14. APPGE -> APPGE: possessive…

Feed Corpus Project

FCP corpus aims to be a million word per day collection of POS-tagged…

Concordance Query: Error Query

When Error query is selected, you can search on the error code…

My jobs

My jobs (job runner) feature shows your long running tasks and…

The New Corpus for Ireland | Nua-Chorpas na hÉireann

[ezcol_1half] The New Corpus for Ireland – user’s guide Welcome…

TatarWaC corpus

Tatar sample corpus is ca 200 thousand words crawled from the…

Icelandic sample corpus

This is a small corpus of Icelandic texts prepared for the Sketch…

General instructions on corpus data directory structure

The aims of these instructions is to ensure that for every corpus,…

Renaming Sketch Grammar relations

CD to directory which contains the compiled corpus files. cd…

Adding sentence boundaries to a compiled corpus

This document explains how structures, such as documents, paragraph,…

Sketch Grammar development corpora

This page describes how to use a sketch grammar in your corpus. In…

Compatibility Matrix

This page provides compatibility matrix of Sketch Engine components…

Uploading multiple files to Sketch Engine

Sketch Engine allows users to build corpora from their own documents.…

Sketch Engine API for IntelliWebSearch

Sketch Engine is a corpus manager tool offering many corpus linguistics…

Preloaded Configuration Templates

When you create a corpus from the Sketch Engine interface (see…

Building sketches from parsed corpora

Introduction Sketch Engine generates word sketches usually using…

Word Sketches definition files

The following files can be used for building word sketches in…

Word Sketch Index Format

This page is a brief overview of the development of the word…

Highlight Only Part of a Complex Query

I want to align a concordance accoding to a part of the query.…

Search Punctuation

To search for punctuation as well as words: Insert the punctuation…

Compare corpora using word lists

To compare two preloaded corpora Open the focus corpus and…

Distinguish Between Lemmas

To look at different lemmas with the same spelling but different…

How do I…?

This page lists possible tasks that a Sketch Engine user might…

Sketch Engine Localisation

The Sketch Engine interface can be translated into any other…

JSON API - creating query

Sketch Engine uses HTTP REST API. All API methods (unless stated…

Full Administration

This feature is available only for local installations (see the…

Text Types, Headers and Subcorpora

Overview When studying a word, phrase, or grammatical construction,…

Preparing Corpus Text

The input format is "vertical" or "word-per-line (WPL)" text,…

czes corpus

CZES is a Czech corpus consisting of newspaper articles and magazine…

TalkBank Persian

The TalkBank Persian corpus contains blog posts to various Farsi…

TED_en corpus

A corpus of transcripts of TED talks. Prepared by Akshay Min…

Scottish Gaelic Wiki corpus

Scottish Gaelic Wikipedia corpus. Downloaded in February 2015.…

pukWaC

The same as ukWaC, but with a further layer of annotation added,…

Romanian WaC (RoWaC) corpus

This Romanian web as corpus was gathered by Monica Macoveiciuc,…

Polish Web Corpus (PolishWaC)

Polish web as corpus has 103 million words and the encoding is…

Parallel Corpora Registry Info

General Attribute Set ATTRIBUTE word STRUCTURE s{ ATTRIBUTE…

Islam – UK

A special English newspaper corpus by Costas Gabrielatos at…

Internet-ZH corpus

Internet-ZH is a Chinese web corpus collected by Serge Sharoff.…

Project Gutenberg Corpus

downloaded with wget: getting Gutenberg cleaned with…

Fryske Akademy Parallel Corpus

Frisian and Dutch not POS tagged aligned sentences Dutch…

ChineseTaiwanWaC corpus

Chinese Taiwan web as corpus has almost 260 million words encoded…

MalaysianWaC corpus

The corpus is prepared by Corpus factory method. Full details…

NepaliWaC corpus

Nepali web corpus downloaded by LCL on Dec 10, 2014. ~1200…

SamoanWaC corpus

Web corpus of Samoan. Created by Bharat Ram Ambati using corpus…

SetswanaWaC corpus

(version 2) The corpus is prepared by Corpus factory method.…

SpanishWaC corpus

This corpus was gathered using a list of URLs provided by Serge…

SwedishWaC corpus

The corpus is prepared by Corpus factory method. Full details…

SDeWaC corpus

SDeWaC is a subset of DeWaC. The creation of sDeWaC is described…

WelshWaC corpus

The corpus is prepared by Corpus factory method by Anil in October…

ThaiWaC corpus

The corpus is prepared by Corpus factory method. Full details…

TurkishWaC corpus

The TurkishWaC corpus is a 32 million word collection of samples…

UKWaCsst corpus

UKWaC tagged with SuperSenseTagger (​sst-light) described in…

DANTE: A Detailed, Accurate, Extensive, Available English Lexical Database

Here we present some sample queries on the database and corresponding…

GujarathiWaC corpus

FrWac web as corpus is a corpus of Gujarati language (Indo-Aryan…

Patakis corpus

Patakis is a 100 million word collection of POS-tagged texts…

GeorgianWaC corpus

Original file owner: bharat.

FinnishWaC corpus

Finnish web as corpus.

FrisianWaC corpus

Frisian web as corpus was crawled in August 2013. It is a corpus…

danishWaC corpus

The corpus prepared by Corpus factory method. It has 288 million…

Domain Specific Corpora

These corpora are prepared from specific domains, e.g. science,…

ScienceBlog corpus

The ScienceBlogs corpus is a selection of posts and comments…

e-flux corpus

The e-flux corpus is a web corpus of English art news digests.…

Environment corpus

English environment related web corpus. Crawled by SpiderLing…

Filipino web corpus (FilipinoWaC)

The corpus was created by Anil in October 2013. It has almost…

Arabic web corpus (WaC)

Arabic web corpus was created by Serge Sharoff and was tagged…

Nineteenthcentury corpus

Actually, the 19th century corpus is only available to Osnabrück…

Penn Historical Corpora

Penn Historical Corpora is a collection of historical English…

A Corpus of English Dialogues 1560–1760

‘Released in Spring 2006, A Corpus of English Dialogues 1560–1760…

Clustering

Clustering can be performed in Sketch Engine on the similar…

Manual for GDEX

To quickly start using Good Dictionary EXamples, see the GDEX…

Syntax of GDEX configuration files

GDEX configuration files are written in YAML (Wikipedia.org).…

COMPAS corpus

The COMPAS is a corpus with about 100 million words which was…

BulgarianNC corpus

Bulgarian National Corpus (see the website of Institute for Bulgarian…

Argamon corpus

The current Argamon corpus contains blog posts to various Farsi…

Algemeen Nederlands Woordenboek (ANW) corpus

The Algemeen Nederlands Woordenboek (ANW) corpus is a balanced…

Dynamic Attributes

To make use of dynamic attributes they have to be set up in …

Allowed language names in corpus configuration

A Afar, Abkhazian, Adyghe, Afrikaans, Aghem, Akan, Amharic,…

Corpus Factory Method

A method for developing large general language corpora which…

New Model Corpus

The New model Corpus is a ~100 million words domain corpus built…

LEXMCI

The 1.7 billion word LEXMCI corpus of English was created by…

London English corpora

The corpus consists of transcripts of informal conversation-like…

Corpus configuration example

If your vertical text contains only words and no annotation,…

Corpus Configuration File: Overview

For the software to be able to use a corpus, there are a number…

Preparing a Text Corpus for Sketch Engine: Overview

This page describes how to prepare a text corpus for indexation…

Sketch Engine Video Tutorials

All videos are accessible also on our YouTube channel. Please…

Compiling corpus

You need to prepare a vertical and registry file before compiling…

Common corpus structures

It is generally practical to divide a corpus into smaller parts…

Scripts for adding header fields

Adding attributes is based on mapping existing structure attributes…

Variation in hit counts

It often seems like you have got a different hit count for the…

Adam Kilgarriff: Structured bibliography

(note: written by himself to 27th April 2015; see also the Wikipedia…
SkE research

Research Agenda

Lexical Computing's research interests lie at the intersection…

Word Sketch highlights

If a noun is usually in the plural, or a verb is usually in the…

Adam's blog

Happy New Year!