Generic Entity Resolution with Data Confidences

Authors: 
Menestrina, D.; Benjelloun, O.; Garcia-Molina, H.
Author: 
Menestrina, D
Benjelloun, O
Garcia-Molina, H
Year: 
2006
Venue: 
Clean DB, 2006
URL: 
http://pike.psu.edu/cleandb06/papers/CameraReady_121.pdf
Citations: 
44
Citations range: 
10 - 49
AttachmentSize
Menestrina2006GenericEntityResolutionwith.pdf223.96 KB

We consider the Entity Resolution (ER) problem (also known
as deduplication, or merge-purge), in which records determined
to represent the same real-world entity are successively
located and merged. Our approach to the ER problem
is generic, in the sense that the functions for comparing and
merging records are viewed as black-boxes. In this context,
managing numerical confidences along with the data makes
the ER problem more challenging to define (e.g., how should
confidences of merged records be combined?), and more expensive
to compute. In this paper, we propose a sound and
flexible model for the ER problem with confidences, and
propose efficient algorithms to solve it. We validate our
algorithms through experiments that show significant performance
improvements over naive schemes.