The CoEDL corpus platform

author: Tom Honeyman date: 2017-10-05

tags: - ANNIS - corpora - transcripts

CoEDL has a “proof-of-concept” corpus platform, currently restricted to corpus contributors. This is a project aimed at making textual materials more readily available to researchers.

This is a basic guide to the platform.

CoEDL corpus platform

The preliminary corpus platform is available, but currently limited to those with a password.

ANNIS is an open source corpus platform. A generic user guide is available, while this is a simple guide to just the features available within the CoEDL version.

Entering the corpus platform

The corpus platform can be accessed at http://go.coedl.net/corpora. Currently access is limited to contributors, via a password. Click on the ‘Login’ button located on the top-right corner of the main page, and then enter your username and password to continue:

../../_images/02-login.png password prompt

The main page looks like this:

../../_images/03-front-page.png main annis page

Note the logout button in the top right.

The basic layout

The page is made up of three components: the query panel and corpus list on the left and the results page(s) on the right.

To begin with, the query panel will be empty:

../../_images/04-query-panel.png empty query panel

The (sub-)corpus panel will list one or more sub-corpora:

../../_images/04-corpus-list.png sub-corpus list

The “Visible:” drop down menu filters the sub-corpus list:

../../_images/04-visible-corpus.png visible list

Corpus contributors can browse other corpora (which is encouraged, so you can see what other types of annotations others are contibuting). View all available corpora by choosing “All”:

../../_images/04-all-subcorpora.png All sub-corpora

Searching

To search the corpora, select one or more subcorpora in the list. In the example, we are searching in the Gurindji-Kriol corpus:

../../_images/04-select-corpus.gif select a sub-corpus

Searching words/tokens

Every single corpus has a baseline layer called “tok” (for “token”). This is usually a word level representation of the primary text. It is the default layer to search on, and so a basic query can be either a search for a word (in quotes):

../../_images/05-query-tok.png search for a word

Or a regular expression between forward slashes (//):

../../_images/05-query-tok-regex.png search for a-initial words

More complex searches can be built with the query builder. This is a good way to learn the full syntax of the annis query language (AQL).

Query builder

The easiest way to build a complex search is to use the query builder in the top left. Click on “query builder”:

../../_images/04-query-panel.png query builder

After clicking “initialise”, we can begin to construct a query.

Queries can be sequences of one or more “tokens” (i.e., annotations on a specific layer or tier). They can fall under the scope of a “span” (e.g., limited to a specific speaker). Metadata for a file can also be used to constrain the search.

In order to fall under the scope of a span, these spans must first exist in the corpus. Not all corpora have these spans. If you provided segmented text (e.g., utterances) with extra information like speaker turns or translations, then you should have annotations for these categories of information. Spans can be any grouping that interests you. For instance, spans of reported speech, of syntactic units, or of any other grouping that may be of interest to you.

Linguistic sequences

Begin by choosing the “word sequences and meta information” search, and then clicking “initialise”. Then add an element/token to a linguistic sequence. First choose which layer you want to match:

../../_images/05-query-builder-ling-seq.png choose a layer

If you choose the default “tok” layer, then you type in a word/token that you’d like to match. Regular expressions can be used, but note that the regular expression must match the whole token, not part of the token.

For any other layer, a list of possible values will appear.

../../_images/05-query-builder-ling-seq-token.png choose a token