Abstract
In the search for a standard unit for use in recognition of emotion in speech, a whole turn, that is the full section of speech by one person in a conversation, is common. Within applications such turns often seem favorable. Yet, high effectiveness of sub-turn entities is known. In this respect a two-stage approach is investigated to provide higher temporal resolution by chunking of speech-turns according to acoustic properties, and multi-instance learning for turn-mapping after individual chunk analysis. For chunking fast pre-segmentation into emotionally quasi-stationary segments by one-pass Viterbi beam search with token passing basing on MFCC is used. Chunk analysis is realized by brute-force large feature space construction with subsequent subset selection, SVM classification, and speaker normalization. Extensive tests reveal differences compared to one-stage processing. Alternatively, syllables are used for chunking.
Originalsprache | Englisch |
---|---|
Seiten | 596-600 |
Seitenumfang | 5 |
DOIs | |
Publikationsstatus | Veröffentlicht - 2007 |
Veranstaltung | 2007 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2007 - Kyoto, Japan Dauer: 9 Dez. 2007 → 13 Dez. 2007 |
Konferenz
Konferenz | 2007 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2007 |
---|---|
Land/Gebiet | Japan |
Ort | Kyoto |
Zeitraum | 9/12/07 → 13/12/07 |