TY - JOUR
T1 - The perception and analysis of the likeability and human likeness of synthesized speech
AU - Baird, Alice
AU - Parada-Cabaleiro, Emilia
AU - Hantke, Simone
AU - Burkhardt, Felix
AU - Cummins, Nicholas
AU - Schuller, Björn
N1 - Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.
PY - 2018
Y1 - 2018
N2 - The synthesized voice has become an ever present aspect of daily life. Heard through our smart-devices and from public announcements, engineers continue in an endeavour to achieve naturalness in such voices. Yet, the degree to which these methods can produce likeable, human like voices, has not been fully evaluated. With recent advancements in synthetic speech technology suggesting that human like imitation is more obtainable, this study asked 25 listeners to evaluate both the likeability and human likeness of a corpus of 13 German male voices, produced via 5 synthesis approaches (from formant to hybrid unit selection, deep neural network systems), and 1 Human control. Results show that unlike visual artificially intelligent elements - as posed by the concept of the Uncanny Valley - likeability consistently improves along with human likeness for the synthesized voice, with recent methods achieving substantially closer results to human speech than older methods. A small scale acoustic analysis shows that the F0 of hybrid systems correlates less closely to human speech with a higher standard deviation for F0. This analysis suggests that limited variance in F0 is linked to a reduction in human likeness, resulting in lower likeability for conventional synthetic speech methods.
AB - The synthesized voice has become an ever present aspect of daily life. Heard through our smart-devices and from public announcements, engineers continue in an endeavour to achieve naturalness in such voices. Yet, the degree to which these methods can produce likeable, human like voices, has not been fully evaluated. With recent advancements in synthetic speech technology suggesting that human like imitation is more obtainable, this study asked 25 listeners to evaluate both the likeability and human likeness of a corpus of 13 German male voices, produced via 5 synthesis approaches (from formant to hybrid unit selection, deep neural network systems), and 1 Human control. Results show that unlike visual artificially intelligent elements - as posed by the concept of the Uncanny Valley - likeability consistently improves along with human likeness for the synthesized voice, with recent methods achieving substantially closer results to human speech than older methods. A small scale acoustic analysis shows that the F0 of hybrid systems correlates less closely to human speech with a higher standard deviation for F0. This analysis suggests that limited variance in F0 is linked to a reduction in human likeness, resulting in lower likeability for conventional synthetic speech methods.
KW - Human likeness
KW - Likeability
KW - Synthesized voices
UR - http://www.scopus.com/inward/record.url?scp=85055001255&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2018-1093
DO - 10.21437/Interspeech.2018-1093
M3 - Conference article
AN - SCOPUS:85055001255
SN - 2308-457X
VL - 2018-September
SP - 2863
EP - 2867
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018
Y2 - 2 September 2018 through 6 September 2018
ER -