TY - GEN
T1 - Deep Modelling Strategies for Human Confidence Classification using Audio-visual Data
AU - Gudipalli, Yagna
AU - Deshpande, Gauri
AU - Patel, Sachin
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Human behavior expressions such as of confidence are time-varying entities. Both vocal and facial cues that convey the human confidence expressions keep varying throughout the duration of analysis. Although, the cues from these two modalities are not always in synchrony, they impact each other and the fused outcome as well. In this paper, we present a deep fusion technique to combine the two modalities and derive a single outcome to infer human confidence. Fused outcome improves the classification performance by capturing the temporal information from both the modalities. The analysis of time-varying nature of expressions in the conversations captured in an interview setup is also presented. We collected data from 51 speakers who participated in interview sessions. The average area under the curve (AUC) of uni-modal models using speech and facial expressions is 70.6% and 69.4%, respectively, for classifying confident videos from non-confident ones in 5-fold cross-validation analysis. Our deep fusion model improves the performance giving an average AUC of 76.8%.
AB - Human behavior expressions such as of confidence are time-varying entities. Both vocal and facial cues that convey the human confidence expressions keep varying throughout the duration of analysis. Although, the cues from these two modalities are not always in synchrony, they impact each other and the fused outcome as well. In this paper, we present a deep fusion technique to combine the two modalities and derive a single outcome to infer human confidence. Fused outcome improves the classification performance by capturing the temporal information from both the modalities. The analysis of time-varying nature of expressions in the conversations captured in an interview setup is also presented. We collected data from 51 speakers who participated in interview sessions. The average area under the curve (AUC) of uni-modal models using speech and facial expressions is 70.6% and 69.4%, respectively, for classifying confident videos from non-confident ones in 5-fold cross-validation analysis. Our deep fusion model improves the performance giving an average AUC of 76.8%.
UR - http://www.scopus.com/inward/record.url?scp=85179639427&partnerID=8YFLogxK
U2 - 10.1109/EMBC40787.2023.10340488
DO - 10.1109/EMBC40787.2023.10340488
M3 - Conference contribution
C2 - 38083410
AN - SCOPUS:85179639427
T3 - Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS
BT - 2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Conference, EMBC 2023 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Conference, EMBC 2023
Y2 - 24 July 2023 through 27 July 2023
ER -