University of Notre Dame
Browse
AnastasopoulosA042018D.pdf (1.97 MB)

Computational Tools for Endangered Language Documentation

Download (1.97 MB)
thesis
posted on 2019-04-06, 00:00 authored by Antonios Anastasopoulos

The traditional method for documenting a language involves the collection of audio or video sources, which are then annotated at multiple granularity levels by trained linguists. This is a painstaking and time-consuming process, which could benefit from machine learning techniques at almost all stages. However, most existing machine learning methods are being developed for high-resource languages and rely on abundant data, rendering them unsuitable for such applications.

At the same time, for many low-resource and endangered languages speech data is easier to obtain than textual data, particularly since most of the world's languages are unwritten. Nevertheless, it is relatively easy to provide written or spoken translations for audio sources, as speakers of a minority language are often bilingual and literate in a high-resource language.

This work is aimed at solving certain problems that arise in the documentation process of an endangered language, due to the minimal annotated resources that are available at this stage. This dissertation mainly focuses on spoken corpora of endangered and low-resource languages with limited translation annotations, tackling problems that cover every layer of linguistic annotation:

  • speech-to-translation alignment: we present an unsupervised method for discovering word or phrase boundaries in the audio signal and aligning the discovered segments with translation words.
  • speech transcription: we developed two novel neural methods for creating a phoneme or grapheme level transcription of the audio, also utilizing any available translations.
  • speech translation: our novel multitask neural model jointly produces a transcription and a (free) translation of an audio segment.
  • morphological analysis: producing a layer of annotation that provides word-level or morpheme-level information. In this work we focus on grammatical (part-of-speech) tagging on an endangered language.

Building off limited or no annotations, our methods are capable of producing helpful suggestions for word or phrase boundaries, as well as transcriptions, translations, or grammatical tags. As a result, our work provides the machine learning methods that could form the backbone of a modern linguistic annotation toolkit, one that could have the potential to significantly accelerate the language documentation process.

History

Date Modified

2019-06-27

Defense Date

2019-01-21

CIP Code

  • 40.0501

Research Director(s)

David Chiang

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Language

  • English

Alternate Identifier

1105810887

Library Record

5113939

OCLC Number

1105810887

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC