Estimating the number and sizes of fuzzy-duplicate clusters

Arvid Heise, Gjergji Kasneci, Felix Naumann

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Duplicates in a dataset are multiple representations of the same real-world entity and constitute a major data quality problem. This paper investigates the problem of estimating the number and sizes of duplicate record clusters in advance and describes a sampling-based method for solving this problem. In extensive experiments, on multiple datasets, we show that the proposed method reliably estimates the number of duplicate clusters, while being highly efficient. Our method can be used a) to measure the dirtiness of a dataset, b) to assess the quality of duplicate detection configurations, such as similarity measures, and c) to gather approximate statistics about the true number of entities represented in the dataset.

Original languageEnglish
Title of host publicationCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages959-968
Number of pages10
ISBN (Electronic)9781450325981
DOIs
StatePublished - 3 Nov 2014
Externally publishedYes
Event23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
Duration: 3 Nov 20147 Nov 2014

Publication series

NameCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management

Conference

Conference23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
Country/TerritoryChina
CityShanghai
Period3/11/147/11/14

Keywords

  • Cluster
  • Data integration
  • Duplicate
  • Estimation
  • Pair

Fingerprint

Dive into the research topics of 'Estimating the number and sizes of fuzzy-duplicate clusters'. Together they form a unique fingerprint.

Cite this