Multiset and DAG Automata for Abstract Meaning Representation

DeBenedetto, Justin Donald

doi:10.7274/cc08hd79z5w

DeBenedettoJD042021D.pdf (688.36 kB)

Multiset and DAG Automata for Abstract Meaning Representation

thesis

posted on 2021-04-19, 00:00 authored by Justin Donald DeBenedetto

Natural Language Processing (NLP) utilizes computers to process human language for various purposes. NLP applications affect people on a daily basis, from digital assistants to automated customer service to automatic translation. These systems traditionally take input primarily in the form of speech or text. When using these systems, however, we often may find that the systems are sensitive to the way we express our thoughts. For example, when using a search engine such as Google or Bing, if the results we get are not what we are looking for we may rephrase our search and get potentially very different results. Google recognizes the need for incorporating NLP advances into their search algorithms and has recently improved their search algorithm by using BERT \cite{devlin-etal-2019-bert}, a large pre-trained neural network. Adding BERT is meant to provide some additional language knowledge and reduce reliance on exact wording. Ultimately, a system which can accurately encode the meaning of a search is likely to produce higher quality results.

Semantic graphs for NLP seek to encode the meaning of their input sentences. By working from the semantics of a sentence when performing machine translation, for example, we may better preserve the meaning of the sentence. In this work we look specifically at Abstract Meaning Representation (AMR) as our semantic graph formalism. AMR allows many sentences to map to the same graph. This reflects the fact that we are able to formulate the same thought in many different ways. The tasks of converting between an AMR graph and a string of text are non-trivial.

Throughout this dissertation we build up the theory underlying some potential methods for tackling the problem of producing a sentence from an AMR graph. It is worth noting that while we use AMR as our motivation and testing formalism, our methods generalize to other semantic representations as well. These methods build upon existing work, utilizing in particular multiset automata and directed acyclic graph (DAG) processing algorithms. Automata have been successful when applied in many other areas, but most existing work on multiset and DAG automata are prohibitively costly to implement. Our contributions include theoretical improvements to the time and space complexity of existing algorithms, novel training methods for learning from data, implementation of these systems, and combining them into a system which can produce a sentence from an input AMR graph. Specifically, we:

Define a new translation from weighted multiset regular expressions to weighted multiset automata more direct and compact than previous work as well as a new composable representation of partial runs of multiset automata more efficient than previous work
Show that the Transformer's sinusoidal positional encodings can be viewed as a multiset automaton
Prove that complex-weighted multiset automata with only self-loops can approximate real-weighted multiset automata and extend DeepSets to compute the forward weights of complex-weighted multiset automata enabling it to handle a new task
Modify and implement an extended DAG recognition algorithm to use complex diagonalized multiset automata in place of positional encodings in a Transformer network for AMR-to-text generation
Demonstrate that these improvements now allow such a system to train on a GPU, opening new opportunities for future systems based on DAG and multiset automata

History

Date Modified

2021-05-21

Defense Date

2021-03-31

CIP Code

40.0501

Research Director(s)

David Chiang

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Alternate Identifier

1251804421

Library Record

6024129

OCLC Number

1251804421

Program Name

Computer Science and Engineering

Usage metrics

Keywords

Not Assigned

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Multiset and DAG Automata for Abstract Meaning Representation

History

Date Modified

Defense Date

CIP Code

Research Director(s)

Degree

Degree Level

Alternate Identifier

Library Record

OCLC Number

Program Name

Usage metrics

Categories

Keywords

Licence

Exports