TY - GEN
T1 - Estimating the number and sizes of fuzzy-duplicate clusters
AU - Heise, Arvid
AU - Kasneci, Gjergji
AU - Naumann, Felix
N1 - Publisher Copyright:
Copyright 2014 ACM.
PY - 2014/11/3
Y1 - 2014/11/3
N2 - Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of estimating the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets, we show that the proposed method reliably estimates the number of duplicate clusters, while being highly efficient. Our method can be used a) to measure the dirtiness of a dataset, b) to assess the quality of duplicate detection configurations, such as similarity measures, and c) to gather approximate statistics about the true number of entities represented in the dataset.
AB - Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of estimating the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets, we show that the proposed method reliably estimates the number of duplicate clusters, while being highly efficient. Our method can be used a) to measure the dirtiness of a dataset, b) to assess the quality of duplicate detection configurations, such as similarity measures, and c) to gather approximate statistics about the true number of entities represented in the dataset.
KW - Cluster
KW - Data integration
KW - Duplicate
KW - Estimation
KW - Pair
UR - http://www.scopus.com/inward/record.url?scp=84937598052&partnerID=8YFLogxK
U2 - 10.1145/2661829.2661885
DO - 10.1145/2661829.2661885
M3 - Conference contribution
AN - SCOPUS:84937598052
T3 - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
SP - 959
EP - 968
BT - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
Y2 - 3 November 2014 through 7 November 2014
ER -