Modeling and Querying Possible Repairs in Duplicate Detection

Beskales, George; Soliman, Mohamed; Ilyas, Ihab; Ben-David, Shai
Ilyas, I
Beskales, G
Ben-David, S
Soliman, M
Citations range: 
10 - 49
vldb09-370.pdf492.21 KB

One of the most prominent data quality problems is the existence
of duplicate records. Current duplicate elimination procedures usually
produce one clean instance (repair) of the input data, by carefully
choosing the parameters of the duplicate detection algorithms.
Finding the right parameter settings can be hard, and in many cases,
perfect settings do not exist. Furthermore, replacing the input dirty
data with one possible clean instance may result in unrecoverable
errors, for example, identification and merging of possible duplicate
records in health care systems.
In this paper, we treat duplicate detection procedures as data processing
tasks with uncertain outcomes. We concentrate on a family
of duplicate detection algorithms that are based on parameterized
clustering. We propose a novel uncertainty model that compactly
encodes the space of possible repairs corresponding to different
parameter settings. We show how to efficiently support relational
queries under our model, and to allow new types of queries on the
set of possible repairs. We give an experimental study illustrating
the scalability and the efficiency of our techniques in different configurations.