Efficient record linkage in large data sets

Authors: 
Jin, L.; Li, C.; Mehrotra, S.
Author: 
Jin, L
Li, C
Mehrotra, S
Year: 
2003
Venue: 
Eighth International Conference on Database Systems for Advanced Applications, 2003
URL: 
http://csdl.computer.org/dl/proceedings/dasfaa/2003/1895/00/18950137.pdf
Citations: 
154
Citations range: 
100 - 499
AttachmentSize
Jin2003Efficientrecordlinkagein.pdf360.89 KB

This paper describes an efficient approach to record linkage.
Given two lists of records, the record-linkage problem
consists of determining all pairs that are similar to each
other, where the overall similarity between two records is
defined based on domain-specific similarities over individual
attributes constituting the record. The record-linkage
problem arises naturally in the context of data cleansing
that usually precedes data analysis and mining. We explore
a novel approach to this problem. For each attribute
of records, we first map values to a multidimensional
Euclidean space that preserves domain-specific similarity.
Many mapping algorithms can be applied, and we use the
FastMap approach as an example. Given the merging rule
that defines when two records are similar, a set of attributes
are chosen along which the merge will proceed. A multidimensional
similarity join over the chosen attributes is used
to determine similar pairs of records. Our extensive experiments
using real data sets show that our solution has very
good efficiency and accuracy.