Click a term to initiate a search.
Data cleaning and integration is typically the most expensive step in the KDD process. A key part, known as record linkage or
de-duplication, is identifying which records in a database refer to the
same entities. This problem is traditionally solved separately for each
candidate record pair (followed by transitive closure). We propose to use
instead a multi-relational approach, performing simultaneous inference
for all candidate pairs, and allowing information to propagate from one
candidate match to another via the attributes they have in common. Our
formulation is based on conditional random fields, and allows an optimal
solution to be found in polynomial time using a graph cut algorithm. Pa-
rameters are learned using a voted perceptron algorithm. Experiments
on real and synthetic databases show that multi-relational record linkage
outperforms the standard approach.