University of Notre Dame
Browse

File(s) under embargo

Audio Representation Learning with Deep Neural Networks

thesis
posted on 2024-02-12, 19:35 authored by Mohammad Rasool Izadi

In this dissertation, we examined three sequence-to-sequence representation challenges: source detection and separation, sound event detection, and disentanglement. For each challenge, we introduced distinct models and assessed their performance by conducting experiments on several datasets and by comparing these results with those of other established models. Our study spanned areas such as deep learning and representation learning, and it touched on bioacoustics, urban sounds, singing voices, and speech across a range of specific tasks.

First, we developed a source segmentation model to isolate an undetermined number of bat echolocation calls from mixed sounds. This design used two interconnected models working together. The primary model identified potential sources, while the subsequent model isolated individual sources within the time-frequency domain.

Next, inspired by attention and graph neural networks, we presented a method to include time-level similarities throughout the time-domain. We blended features across various layers with our adaptive affinity mixup technique. This enhancement boosted the event-F1 scores of our sound event detection model by 8.2\% when applied to urban sounds.

Finally, we delved into weakly supervised disentanglement using a multi-rate latent space. We put forward a unique framework to represent and produce variable-length sequences through paired samples. Our method incorporates a straightforward swapping mechanism and variational transformers. We provided a theoretical demonstration that swapping can attain optimal disentanglement under weak supervision. Experimental results on singing voices, speech, and images confirm that our technique consistently outperforms other methods.

In conclusion, this dissertation offers innovative approaches to sequence-to-sequence representation challenges, emphasizing the blend of cutting-edge techniques and practical applications. Our findings not only advance the current understanding of sound source detection, event detection, and sequential disentanglement but also set a precedent for future research in these areas. The consistent improvements observed across various tasks underscore the potential of our proposed methods in diverse audio domains, hinting at broader applications and further explorations in representation learning.

History

Defense Date

2022-05-31

CIP Code

  • 14.1001

Research Director(s)

Robert L. Stevenson

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

OCLC Number

1406770862

Program Name

  • Electrical Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC