TY - JOUR
T1 - Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients
AU - NAPKON Study Group
AU - NAPKON Use & Access Committee
AU - NAPKON Steering Committee
AU - NAPKON Study Site Group
AU - NAPKON Infrastructure Group
AU - Koll, Carolin E.M.
AU - Hopff, Sina M.
AU - Meurers, Thierry
AU - Lee, Chin Huang
AU - Kohls, Mirjam
AU - Stellbrink, Christoph
AU - Thibeault, Charlotte
AU - Reinke, Lennart
AU - Steinbrecher, Sarah
AU - Schreiber, Stefan
AU - Mitrov, Lazar
AU - Frank, Sandra
AU - Miljukov, Olga
AU - Erber, Johanna
AU - Hellmuth, Johannes C.
AU - Reese, Jens Peter
AU - Steinbeis, Fridolin
AU - Bahmer, Thomas
AU - Hagen, Marina
AU - Meybohm, Patrick
AU - Hansch, Stefan
AU - Vadász, István
AU - Krist, Lilian
AU - Jiru-Hillmann, Steffi
AU - Prasser, Fabian
AU - Vehreschild, Jörg Janne
AU - Witzke, O.
AU - Schmidt, G.
AU - Milger, K.
AU - Friedrichs, A.
AU - Ellert, C.
AU - von Lilienfeld-Toal, M.
AU - Schreiber, S.
AU - Neuhauser, H.
AU - Heyder, R.
AU - Herold, S.
AU - Brochhagen, L.
AU - Otte, M.
AU - Madel, R. J.
AU - Krawczyk, A.
AU - Elsner, C.
AU - Dolff, S.
AU - Zeh, S.
AU - Santibanez-Santana, M.
AU - Papenbrock, J.
AU - Nussbeck, S.
AU - Moerer, O.
AU - Kettwig, M.
AU - Hermanns, G.
AU - Hafke, A.
N1 - Publisher Copyright:
© The Author(s) 2022.
PY - 2022/12
Y1 - 2022/12
N2 - Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.
AB - Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.
UR - http://www.scopus.com/inward/record.url?scp=85144597072&partnerID=8YFLogxK
U2 - 10.1038/s41597-022-01669-9
DO - 10.1038/s41597-022-01669-9
M3 - Article
C2 - 36543828
AN - SCOPUS:85144597072
SN - 2052-4463
VL - 9
JO - Scientific Data
JF - Scientific Data
IS - 1
M1 - 776
ER -