cs.stanford.edu

Entity resolution with iterative blocking

Authors: 
Whang, Steven Euijong; Menestrina, David; Koutrika, Georgia; Theobald, Martin; Garcia-Molina, Hector
Year: 
2009
Venue: 
SIGMOD

Entity Resolution (ER) is the problem of identifying which records in a database refer to the same real-world entity. An exhaustive ER process involves computing the similarities between pairs of records, which can be very expensive for large datasets. Various blocking techniques can be used to enhance the performance of ER by dividing the records into blocks in multiple ways and only comparing records within the same block. However, most blocking techniques process blocks separately and do not exploit the results of other blocks.

Generic Entity Resolution with Data Confidences

Authors: 
Menestrina, D.; Benjelloun, O.; Garcia-Molina, H.
Year: 
2006
Venue: 
Clean DB, 2006

We consider the Entity Resolution (ER) problem (also known
as deduplication, or merge-purge), in which records determined
to represent the same real-world entity are successively
located and merged. Our approach to the ER problem
is generic, in the sense that the functions for comparing and
merging records are viewed as black-boxes. In this context,
managing numerical confidences along with the data makes
the ER problem more challenging to define (e.g., how should
confidences of merged records be combined?), and more expensive
to compute. In this paper, we propose a sound and

Robust Identification of Fuzzy Duplicates

Authors: 
Chaudhuri, Surajit; Ganti, Venkatesh; Motwani, Rajeev
Year: 
2005
Venue: 
ICDE, 2005

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches.

Syndicate content