Multiset and DAG Automata for Abstract Meaning Representation

Doctoral Dissertation

Abstract

Natural Language Processing (NLP) utilizes computers to process human language for various purposes. NLP applications affect people on a daily basis, from digital assistants to automated customer service to automatic translation. These systems traditionally take input primarily in the form of speech or text. When using these systems, however, we often may find that the systems are sensitive to the way we express our thoughts. For example, when using a search engine such as Google or Bing, if the results we get are not what we are looking for we may rephrase our search and get potentially very different results. Google recognizes the need for incorporating NLP advances into their search algorithms and has recently improved their search algorithm by using BERT \cite{devlin-etal-2019-bert}, a large pre-trained neural network. Adding BERT is meant to provide some additional language knowledge and reduce reliance on exact wording. Ultimately, a system which can accurately encode the meaning of a search is likely to produce higher quality results.

Semantic graphs for NLP seek to encode the meaning of their input sentences. By working from the semantics of a sentence when performing machine translation, for example, we may better preserve the meaning of the sentence. In this work we look specifically at Abstract Meaning Representation (AMR) as our semantic graph formalism. AMR allows many sentences to map to the same graph. This reflects the fact that we are able to formulate the same thought in many different ways. The tasks of converting between an AMR graph and a string of text are non-trivial.

Throughout this dissertation we build up the theory underlying some potential methods for tackling the problem of producing a sentence from an AMR graph. It is worth noting that while we use AMR as our motivation and testing formalism, our methods generalize to other semantic representations as well. These methods build upon existing work, utilizing in particular multiset automata and directed acyclic graph (DAG) processing algorithms. Automata have been successful when applied in many other areas, but most existing work on multiset and DAG automata are prohibitively costly to implement. Our contributions include theoretical improvements to the time and space complexity of existing algorithms, novel training methods for learning from data, implementation of these systems, and combining them into a system which can produce a sentence from an input AMR graph. Specifically, we:

  • Define a new translation from weighted multiset regular expressions to weighted multiset automata more direct and compact than previous work as well as a new composable representation of partial runs of multiset automata more efficient than previous work
  • Show that the Transformer’s sinusoidal positional encodings can be viewed as a multiset automaton
  • Prove that complex-weighted multiset automata with only self-loops can approximate real-weighted multiset automata and extend DeepSets to compute the forward weights of complex-weighted multiset automata enabling it to handle a new task
  • Modify and implement an extended DAG recognition algorithm to use complex diagonalized multiset automata in place of positional encodings in a Transformer network for AMR-to-text generation
  • Demonstrate that these improvements now allow such a system to train on a GPU, opening new opportunities for future systems based on DAG and multiset automata

Attributes

Attribute NameValues
Author Justin Donald DeBenedetto
Contributor David Chiang, Research Director
Degree Level Doctoral Dissertation
Degree Discipline Computer Science and Engineering
Degree Name Doctor of Philosophy
Banner Code
  • PHD-CSE

Defense Date
  • 2021-03-31

Submission Date 2021-04-19
Record Visibility Public
Content License
  • All rights reserved

Departments and Units
Catalog Record

Files

Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.