TY - GEN
T1 - Multi-attentive detection of the spider monkey whinny in the (actual) wild
AU - Rizos, Georgios
AU - Lawson, Jenna
AU - Han, Zhuoda
AU - Butler, Duncan
AU - Rosindell, James
AU - Mikolajczyk, Krystian
AU - Banks-Leite, Cristina
AU - Schuller, Björn W.
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and show that it performs significantly better than the baseline, as well as a collection of other recent audio models. We then further enhance our model, by performing an extensive comparative study of recent sequence pooling mechanisms, and achieve our best result using multi-head selfattention followed by concatenation of the head-specific pooled embeddings - better than prediction pooling methods, as well as compared to other recent sequence pooling tricks. We perform these experiments on a novel dataset of spider monkey whinny calls we introduce here, recorded in a rainforest in the South- Pacific coast of Costa Rica, with a promising outlook pertaining to minimally invasive wildlife monitoring.
AB - We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and show that it performs significantly better than the baseline, as well as a collection of other recent audio models. We then further enhance our model, by performing an extensive comparative study of recent sequence pooling mechanisms, and achieve our best result using multi-head selfattention followed by concatenation of the head-specific pooled embeddings - better than prediction pooling methods, as well as compared to other recent sequence pooling tricks. We perform these experiments on a novel dataset of spider monkey whinny calls we introduce here, recorded in a rainforest in the South- Pacific coast of Costa Rica, with a promising outlook pertaining to minimally invasive wildlife monitoring.
KW - Acoustic event detection
KW - Bioacoustics
KW - Deep attention models
KW - Multiple instance learning
KW - Wildlife monitoring
UR - https://www.scopus.com/pages/publications/85119268381
U2 - 10.21437/Interspeech.2021-1969
DO - 10.21437/Interspeech.2021-1969
M3 - Conference contribution
AN - SCOPUS:85119268381
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4306
EP - 4310
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -