Conventionally, most research on speech processing focuses on automatic speech recognition (ASR), i.e., transcribing speech to text. However, natural speech does not only contain text content information, but also much other information such as emotion and even the speaker’s health status. That means we can extract more information from speech besides text content and use them for novel applications. Specifically, we can develop speech processing systems for healthcare applications such as building convenient and low-cost diagnose, screening, or monitoring solutions. In the first part of this thesis, I investigate how to build speech processing systems for healthcare applications. Specifically, I explore the use of speech systems for monitoring and early diagnosis the autism spectrum disorders, emotional and behavioral disorders, and major depressive disorder.
On the other hand, with the fast-growing number of users and usage scenarios, the security problem of speech processing systems (e.g., Amazon Alexa) becomes a new concern. Recent work has found speech processing systems are vulnerable to multiple types of attacks. However, it is still unclear how dangerous these attacks are in realistic settings. Therefore, in the second part of the thesis, I first systematically explore the vulnerabilities of speech processing systems. Then, I conduct a focused study on adversarial attacks to deep neural network-based models since deep neural networks are becoming the mainstream technique in a variety of speech applications such as speech recognition and speaker identification. Finally, I investigate the effective defense strategies protecting speech processing systems against malicious attacks in realistic settings.
Overall, this thesis aims to address two orthogonal problems about speech processing, and the goal of the research is to broaden the applications and proves the robustness of machine learning-based speech processing systems.