In this dissertation, we examined three sequence-to-sequence representation challenges: source detection and separation, sound event detection, and disentanglement. For each challenge, we introduced distinct models and assessed their performance by conducting experiments on several datasets and by comparing these results with those of other established models. Our study spanned areas such as deep learning and representation learning, and it touched on bioacoustics, urban sounds, singing voices, and speech across a range of specific tasks.
First, we developed a source segmentation model to isolate an undetermined number of bat echolocation calls from mixed sounds. This design used two interconnected models working together. The primary model identified potential sources, while the subsequent model isolated individual sources within the time-frequency domain.
Next, inspired by attention and graph neural networks, we presented a method to include time-level similarities throughout the time-domain. We blended features across various layers with our adaptive affinity mixup technique. This enhancement boosted the event-F1 scores of our sound event detection model by 8.2\% when applied to urban sounds.
Finally, we delved into weakly supervised disentanglement using a multi-rate latent space. We put forward a unique framework to represent and produce variable-length sequences through paired samples. Our method incorporates a straightforward swapping mechanism and variational transformers. We provided a theoretical demonstration that swapping can attain optimal disentanglement under weak supervision. Experimental results on singing voices, speech, and images confirm that our technique consistently outperforms other methods.
In conclusion, this dissertation offers innovative approaches to sequence-to-sequence representation challenges, emphasizing the blend of cutting-edge techniques and practical applications. Our findings not only advance the current understanding of sound source detection, event detection, and sequential disentanglement but also set a precedent for future research in these areas. The consistent improvements observed across various tasks underscore the potential of our proposed methods in diverse audio domains, hinting at broader applications and further explorations in representation learning.