Efficient record linkage in large data sets

Guided search

Click a term to initiate a search.

Keyword search

Efficient record linkage in large data sets

Mon, 10/09/2006 - 12:54 — thor

Authors:

Jin, L.; Li, C.; Mehrotra, S.

Author:

Jin, L

Li, C

Mehrotra, S

Year:

2003

Venue:

Eighth International Conference on Database Systems for Advanced Applications, 2003

URL:

http://csdl.computer.org/dl/proceedings/dasfaa/2003/1895/00/18950137.pdf

Citations:

154

Citations range:

100 - 499

Attachment	Size
Jin2003Efficientrecordlinkagein.pdf	360.89 KB

This paper describes an efficient approach to record linkage.
Given two lists of records, the record-linkage problem
consists of determining all pairs that are similar to each
other, where the overall similarity between two records is
defined based on domain-specific similarities over individual
attributes constituting the record. The record-linkage
problem arises naturally in the context of data cleansing
that usually precedes data analysis and mining. We explore
a novel approach to this problem. For each attribute
of records, we first map values to a multidimensional
Euclidean space that preserves domain-specific similarity.
Many mapping algorithms can be applied, and we use the
FastMap approach as an example. Given the merging rule
that defines when two records are similar, a set of attributes
are chosen along which the merge will proceed. A multidimensional
similarity join over the chosen attributes is used
to determine similar pairs of records. Our extensive experiments
using real data sets show that our solution has very
good efficiency and accuracy.

ics.uci.edu

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Efficient record linkage in large data sets

Related categories

User login