TY - JOUR
T1 - An Evaluation of Speech-Based Recognition of Emotional and Physiological Markers of Stress
AU - Baird, Alice
AU - Triantafyllopoulos, Andreas
AU - Zänkert, Sandra
AU - Ottl, Sandra
AU - Christ, Lukas
AU - Stappen, Lukas
AU - Konzok, Julian
AU - Sturmbauer, Sarah
AU - Meßner, Eva Maria
AU - Kudielka, Brigitte M.
AU - Rohleder, Nicolas
AU - Baumeister, Harald
AU - Schuller, Björn W.
N1 - Publisher Copyright:
Copyright © 2021 Baird, Triantafyllopoulos, Zänkert, Ottl, Christ, Stappen, Konzok, Sturmbauer, Meßner, Kudielka, Rohleder, Baumeister and Schuller.
PY - 2021/12/6
Y1 - 2021/12/6
N2 - Life in modern societies is fast-paced and full of stress-inducing demands. The development of stress monitoring methods is a growing area of research due to the personal and economic advantages that timely detection provides. Studies have shown that speech-based features can be utilised to robustly predict several physiological markers of stress, including emotional state, continuous heart rate, and the stress hormone, cortisol. In this contribution, we extend previous works by the authors, utilising three German language corpora including more than 100 subjects undergoing a Trier Social Stress Test protocol. We present cross-corpus and transfer learning results which explore the efficacy of the speech signal to predict three physiological markers of stress—sequentially measured saliva-based cortisol, continuous heart rate as beats per minute (BPM), and continuous respiration. For this, we extract several features from audio as well as video and apply various machine learning architectures, including a temporal context-based Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). For the task of predicting cortisol levels from speech, deep learning improves on results obtained by conventional support vector regression—yielding a Spearman correlation coefficient (ρ) of 0.770 and 0.698 for cortisol measurements taken 10 and 20 min after the stress period for the two corpora applicable—showing that audio features alone are sufficient for predicting cortisol, with audiovisual fusion to an extent improving such results. We also obtain a Root Mean Square Error (RMSE) of 38 and 22 BPM for continuous heart rate prediction on the two corpora where this information is available, and a normalised RMSE (NRMSE) of 0.120 for respiration prediction (−10: 10). Both of these continuous physiological signals show to be highly effective markers of stress (based on cortisol grouping analysis), both when available as ground truth and when predicted using speech. This contribution opens up new avenues for future exploration of these signals as proxies for stress in naturalistic settings.
AB - Life in modern societies is fast-paced and full of stress-inducing demands. The development of stress monitoring methods is a growing area of research due to the personal and economic advantages that timely detection provides. Studies have shown that speech-based features can be utilised to robustly predict several physiological markers of stress, including emotional state, continuous heart rate, and the stress hormone, cortisol. In this contribution, we extend previous works by the authors, utilising three German language corpora including more than 100 subjects undergoing a Trier Social Stress Test protocol. We present cross-corpus and transfer learning results which explore the efficacy of the speech signal to predict three physiological markers of stress—sequentially measured saliva-based cortisol, continuous heart rate as beats per minute (BPM), and continuous respiration. For this, we extract several features from audio as well as video and apply various machine learning architectures, including a temporal context-based Long Short-Term Memory Recurrent Neural Network (LSTM-RNN). For the task of predicting cortisol levels from speech, deep learning improves on results obtained by conventional support vector regression—yielding a Spearman correlation coefficient (ρ) of 0.770 and 0.698 for cortisol measurements taken 10 and 20 min after the stress period for the two corpora applicable—showing that audio features alone are sufficient for predicting cortisol, with audiovisual fusion to an extent improving such results. We also obtain a Root Mean Square Error (RMSE) of 38 and 22 BPM for continuous heart rate prediction on the two corpora where this information is available, and a normalised RMSE (NRMSE) of 0.120 for respiration prediction (−10: 10). Both of these continuous physiological signals show to be highly effective markers of stress (based on cortisol grouping analysis), both when available as ground truth and when predicted using speech. This contribution opens up new avenues for future exploration of these signals as proxies for stress in naturalistic settings.
KW - affective computing
KW - computer audition
KW - multimodal
KW - paralinguistics
KW - stress
UR - http://www.scopus.com/inward/record.url?scp=85121867630&partnerID=8YFLogxK
U2 - 10.3389/fcomp.2021.750284
DO - 10.3389/fcomp.2021.750284
M3 - Article
AN - SCOPUS:85121867630
SN - 2624-9898
VL - 3
JO - Frontiers in Computer Science
JF - Frontiers in Computer Science
M1 - 750284
ER -