Modeling and Querying Possible Repairs in Duplicate Detection

Guided search

Click a term to initiate a search.

Keyword search

Modeling and Querying Possible Repairs in Duplicate Detection

Thu, 09/03/2009 - 11:06 — koepcke

Authors:

Beskales, George; Soliman, Mohamed; Ilyas, Ihab; Ben-David, Shai

Author:

Ilyas, I

Beskales, G

Ben-David, S

Soliman, M

Year:

2009

Venue:

VLDB

Citations:

Citations range:

10 - 49

Attachment	Size
vldb09-370.pdf	492.21 KB

One of the most prominent data quality problems is the existence
of duplicate records. Current duplicate elimination procedures usually
produce one clean instance (repair) of the input data, by carefully
choosing the parameters of the duplicate detection algorithms.
Finding the right parameter settings can be hard, and in many cases,
perfect settings do not exist. Furthermore, replacing the input dirty
data with one possible clean instance may result in unrecoverable
errors, for example, identification and merging of possible duplicate
records in health care systems.
In this paper, we treat duplicate detection procedures as data processing
tasks with uncertain outcomes. We concentrate on a family
of duplicate detection algorithms that are based on parameterized
clustering. We propose a novel uncertainty model that compactly
encodes the space of possible repairs corresponding to different
parameter settings. We show how to efficiently support relational
queries under our model, and to allow new types of queries on the
set of possible repairs. We give an experimental study illustrating
the scalability and the efficiency of our techniques in different configurations.

cs.uwaterloo.ca

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Modeling and Querying Possible Repairs in Duplicate Detection

Related categories

User login