Robust multi-stream keyword and non-linguistic vocalization detection for computationally intelligent virtual agents

Martin Wöllmer, Erik Marchi, Stefano Squartini, Björn Schuller

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

Systems for keyword and non-linguistic vocalization detection in conversational agent applications need to be robust with respect to background noise and different speaking styles. Focussing on the Sensitive Artificial Listener (SAL) scenario which involves spontaneous, emotionally colored speech, this paper proposes a multi-stream model that applies the principle of Long Short-Term Memory to generate context-sensitive phoneme predictions which can be used for keyword detection. Further, we investigate the incorporation of noisy training material in order to create noise robust acoustic models. We show that both strategies can improve recognition performance when evaluated on spontaneous human-machine conversations as contained in the SEMAINE database.

Original languageEnglish
Title of host publicationAdvances in Neural Networks - 8th International Symposium on Neural Networks, ISNN 2011
Pages496-505
Number of pages10
EditionPART 2
DOIs
StatePublished - 2011
Event8th International Symposium on Neural Networks, ISNN 2011 - Guilin, China
Duration: 29 May 20111 Jun 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume6676 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference8th International Symposium on Neural Networks, ISNN 2011
Country/TerritoryChina
CityGuilin
Period29/05/111/06/11

Keywords

  • Conversational agents
  • keyword spotting
  • long short-term memory
  • multi-condition training

Fingerprint

Dive into the research topics of 'Robust multi-stream keyword and non-linguistic vocalization detection for computationally intelligent virtual agents'. Together they form a unique fingerprint.

Cite this