Speech augmentation via speaker-specific noise in unseen environment

Ya'nan Guo, Ziping Zhao, Yide Ma, Björn Schuller

Research output: Contribution to journalConference articlepeer-review

Abstract

Speech augmentation is a common and effective strategy to avoid overfitting and improve on the robustness of an emotion recognition model. In this paper, we investigate for the first time the intrinsic attributes in a speech signal using the multi-resolution analysis theory and the Hilbert-Huang Spectrum, with the goal of developing a robust speech augmentation approach from raw speech data. Specifically, speech decomposition in a double tree complex wavelet transform domain is realized, to obtain sub-speech signals; then, the Hilbert Spectrum using Hilbert-Huang Transform is calculated for each sub-band to capture the noise content in unseen environments with the voice restriction to 100−4000 Hz; finally, the speech-specific noise that varies with the speaker individual, scenarios, environment, and voice recording equipment, can be reconstructed from the top two high-frequency sub-bands to enhance the raw signal. Our proposed speech augmentation is demonstrated using five robust machine learning architectures based on the RAVDESS database, achieving up to 9.3 % higher accuracy compared to the performance on raw data for an emotion recognition task.

Original languageEnglish
Pages (from-to)1781-1785
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2019-September
DOIs
StatePublished - 2019
Externally publishedYes
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: 15 Sep 201919 Sep 2019

Keywords

  • Bidirectional LSTM−Attention
  • Emotion Recognition
  • Speech Augmentation
  • Speech decomposition

Fingerprint

Dive into the research topics of 'Speech augmentation via speaker-specific noise in unseen environment'. Together they form a unique fingerprint.

Cite this