TY - GEN
T1 - Augmented Datasheets for Speech Datasets and Ethical Decision-Making
AU - Papakyriakopoulos, Orestis
AU - Choi, Anna Seo Gyeong
AU - Thong, William
AU - Zhao, Dora
AU - Andrews, Jerone
AU - Bourke, Rebecca
AU - Xiang, Alice
AU - Koenecke, Allison
N1 - Publisher Copyright:
© 2023 Owner/Author.
PY - 2023/6/12
Y1 - 2023/6/12
N2 - Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets1, which can be used in addition to "Datasheets for Datasets"[78]. We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
AB - Speech datasets are crucial for training Speech Language Technologies (SLT); however, the lack of diversity of the underlying training data can lead to serious limitations in building equitable and robust SLT products, especially along dimensions of language, accent, dialect, variety, and speech impairment - and the intersectionality of speech features with socioeconomic and demographic features. Furthermore, there is often a lack of oversight on the underlying training data - commonly built on massive web-crawling and/or publicly available speech - with regard to the ethics of such data collection. To encourage standardized documentation of such speech data components, we introduce an augmented datasheet for speech datasets1, which can be used in addition to "Datasheets for Datasets"[78]. We then exemplify the importance of each question in our augmented datasheet based on in-depth literature reviews of speech data used in domains such as machine learning, linguistics, and health. Finally, we encourage practitioners - ranging from dataset creators to researchers - to use our augmented datasheet to better define the scope, properties, and limits of speech datasets, while also encouraging consideration of data-subject protection and user community empowerment. Ethical dataset creation is not a one-size-fits-all process, but dataset creators can use our augmented datasheet to reflexively consider the social context of related SLT applications and data sources in order to foster more inclusive SLT products downstream.
KW - datasets
KW - datasheets
KW - ethics
KW - speech
KW - transparency
UR - http://www.scopus.com/inward/record.url?scp=85163580778&partnerID=8YFLogxK
U2 - 10.1145/3593013.3594049
DO - 10.1145/3593013.3594049
M3 - Conference contribution
AN - SCOPUS:85163580778
T3 - ACM International Conference Proceeding Series
SP - 881
EP - 904
BT - Proceedings of the 6th ACM Conference on Fairness, Accountability, and Transparency, FAccT 2023
PB - Association for Computing Machinery
T2 - 6th ACM Conference on Fairness, Accountability, and Transparency, FAccT 2023
Y2 - 12 June 2023 through 15 June 2023
ER -