Quick automatic glossing in CLAN
author: Sasha Wilmoth, Simon Hammond, Hannah Sarvasy date: 2017-05-30
tags: - Glossing - CLAN - Python - Appen
categories: - Scripts
Introduction
Hannah Sarvasy is conducting a longitudinal study of five children acquiring the Papuan language Nungon in a remote village of Papua New Guinea. This entails monthly recordings of the target children that are conducted and transcribed by four Nungon-speaking assistants in the village. The corpus will total 104 hours of transcribed natural speech recordings. Hannah’s task is to add interlinear morpheme-by-morpheme parses and glossing and free English translations to the Nungon transcriptions, which are in the CHAT format. The glossing is to be done
CLAN’s built-in MOR program is an effective tool for glossing if you’ve already transcribed your data with morpheme boundaries, and you have a lexicon in .cut format.
However, Hannah had some slightly different needs, and I figured we could write a simple script that could correct spelling on the transcription tier, add a pre-determined parse, and add a gloss.
This script, glossFile.py is not an automatic parser - it simply takes the words you feed it, and makes no guesses about their structure.
It might be useful for you if you want to do a rough automatic pass before checking through your glossing manually, and you don’t already have a large lexicon or parser set up. It was also designed to suit our particular circumstances, as the data had a lot of variation in spelling and spacing, and many repeated words (so we could cover a lot of ground without a parser).
Instructions
Requirements
glossFile.py can be found here here. You will need a GitLab account to request access.
glossFile.py requires Python 2.7, and does not currently work with Python 3. It works on Mac, and has not been tested on Windows.
Step 1: Create a lexicon
Compile a list of words in your data, sorted by frequency. You can use CLAN’s FREQ command to do this.
I also made a pre-populated lexicon based on some files that had already been manually glossed.
I then sent all the remaining words to Hannah, sorted by frequency, in the following format:
Word | Spelling Correction | %gls | %xcod | Frequency —|—|—|—|—| no | no | no | | 1500 ma | ma | ma | | 772 oro | oro | oro | | 752 dek | dek | dek | | 543 ho | ho | ho | | 520 diyogo | diyogo | diyogo | | 230 nano | nano | nano | | 200 awa | awa | awa | | 187 hewam | hewam | hewam | | 175 arap | arap | arap | | 174 Lisa | Lisa | Lisa | | 169 dec | dec | dec | | 167 bop | bop | bop | | 165 gon | gon | gon | | 162 to | to | to | | 155 mai | mai | mai | | 151
Hannah then checked through this list, and for each entry:
Made a correction in the second column if necessary
In order to reduce the manual effort of entering glossing for misspelled words multiple times, the third and fourth columns are ignored if there is a spelling correction.
Corrected words are then glossed according to the correctly-spelled entry (which already existed in most cases).
Added the appropriate morpheme boundaries
This example shows the most frequent items, which happened to include a lot of suffixes and clitics. If a
-
or=
is added in the third column, the transcription tier is corrected to reflect this, and a space is removed.
Added a gloss for the whole word.
Note:
Multiple entries for homophones are fine, and will be separated with a # symbol in both the %gls and %xcod tiers for manual disambiguation.
If a word is not found in the lexicon, it is just printed as is on all tiers in the output.
If a ‘word’ is not a morpheme at all, but should just be connected to the previous word, a full stop can be added in the third column. E.g.
.kak
tells the script that the wordkak
is actually a part of the previous word.
Word | Spelling Correction | %gls | %xcod | Comments
—|—|—|—|—|
no | no | -no | -3sg.poss|This means that example no
will become exampleno
/example-no
/example-3sg.poss
on the three tiers.
ma | ma | =ma | =rel|
oro | oro | oro | adv^well|
dek | dek | =dek | =loc|
ho | ho | =ho | =foc|
diyogo | diyogo | diyogo | q^how|
nano | nanno | nano | |The script only looks at the second column and ignores the rest.
awa | aawa | awa | |
hewam | hewam | hewam | n^tree.kangaroo|
arap | arap | arap | n^mammal|
Lisa | Lisa | Lisa | n^name|
dec | dec | =dec | |
bop | boop | bop | |
gon | gon | =gon | =restr|
to | to | =to | =foc|
mai | mai | =ma-i | =rel-top|
Step 2: Run the script
Once you have a 4-column lexicon file (we can discard the frequency column), the command is:
python glossFile.py [-h] lexicon input.cha > output.cha
This will turn a passage like this (from 09_septemba_niumen.cha):
*CHI: biksa yandinga itdok.
%com: dogu digu yandinga itdok.
*MOT: opmu menon ngo nungon to duwonga itdok ngo.
*MOT: ngo ngo.
*CHI: nnn nandumau.
into this:
*CHI: biksa yandinga itdok.
%gls: biksa yandinga it-do-k
%xcod: tpn^picture yandinga v^be-rp-3sg
%eng:
%com: dogu digu yandinga itdok.
*MOT: opmu menon ngo nungonto duwonga itdok ngo.
%gls: opmou menon ngo nungon=to duwo-nga it-do-k ngo
%xcod: adj^small menon d^prox q^what=foc v^sleep-mv.ss v^be-rp-3sg d^prox
%eng:
*MOT: ngo ngo.
%gls: ngo ngo
%xcod: d^prox d^prox
%eng:
*CHI: nnn nandumau.
%gls: nn nandu=ma au#nandu=ma-u
%xcod: ij n^non.specific=rel adj^other#n^non.specific=rel-top
%eng:
Note:
oe ho is corrected to oeho,
yandinga and menon are just printed as is (they weren’t in the lexicon),
nandumau has a couple of options to choose from, nandu=ma au (n^non.specific=rel adj^other) or nandu=ma-u (n^non.specific=rel-top)
Acknowledgements
glossFile.py was written by Simon Hammond thanks to Appen’s partnership with CoEDL.
All Nungon data shown here was collected by Hannah Sarvasy, and transcribed by Nungon-speaking assistants.