TY - JOUR
T1 - Multistage linguistic conditioning of convolutional layers for speech emotion recognition
AU - Triantafyllopoulos, Andreas
AU - Reichel, Uwe
AU - Liu, Shuo
AU - Huber, Stephan
AU - Eyben, Florian
AU - Schuller, Björn W.
N1 - Publisher Copyright:
Copyright © 2023 Triantafyllopoulos, Reichel, Liu, Huber, Eyben and Schuller.
PY - 2023/2/9
Y1 - 2023/2/9
N2 - Introduction: The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods: In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results: Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion: Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.
AB - Introduction: The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods: In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results: Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion: Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.
KW - machine learning
KW - multimodal fusion
KW - natural language processing
KW - speech emotion recognition
KW - speech processing
UR - http://www.scopus.com/inward/record.url?scp=85148588644&partnerID=8YFLogxK
U2 - 10.3389/fcomp.2023.1072479
DO - 10.3389/fcomp.2023.1072479
M3 - Article
AN - SCOPUS:85148588644
SN - 2624-9898
VL - 5
JO - Frontiers in Computer Science
JF - Frontiers in Computer Science
M1 - 1072479
ER -