Abstract
Speech augmentation is a common and effective strategy to avoid overfitting and improve on the robustness of an emotion recognition model. In this paper, we investigate for the first time the intrinsic attributes in a speech signal using the multi-resolution analysis theory and the Hilbert-Huang Spectrum, with the goal of developing a robust speech augmentation approach from raw speech data. Specifically, speech decomposition in a double tree complex wavelet transform domain is realized, to obtain sub-speech signals; then, the Hilbert Spectrum using Hilbert-Huang Transform is calculated for each sub-band to capture the noise content in unseen environments with the voice restriction to 100−4000 Hz; finally, the speech-specific noise that varies with the speaker individual, scenarios, environment, and voice recording equipment, can be reconstructed from the top two high-frequency sub-bands to enhance the raw signal. Our proposed speech augmentation is demonstrated using five robust machine learning architectures based on the RAVDESS database, achieving up to 9.3 % higher accuracy compared to the performance on raw data for an emotion recognition task.
Original language | English |
---|---|
Pages (from-to) | 1781-1785 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2019-September |
DOIs | |
State | Published - 2019 |
Externally published | Yes |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria Duration: 15 Sep 2019 → 19 Sep 2019 |
Keywords
- Bidirectional LSTM−Attention
- Emotion Recognition
- Speech Augmentation
- Speech decomposition