A Fast Linkage Detection Scheme for Multi-Source Information Integration

Aizawa, A; Oyama, K
Web Information Retrieval and Integration
Record linkage refers to techniques for identifying
records associated with the same real-world entities.
Record linkage is not only crucial in integrating
multi-source databases that have been generated independently,
but is also considered to be one of the key
issues in integrating heterogeneous Web resources. However,
when targeting large-scale data, the cost of enumerating
all the possible linkages often becomes impracticably
high. Based on this background, this paper
proposes a fast and efficient method for linkage detection.
The features of the proposed approach are: first, it
exploits a suffix array structure that enables linkage detection
using variable length n-grams. Second, it dynamically
generates blocks of possibly associated records
using ‘blocking keys’ extracted from already known reliable
linkages. The results from our preliminary experiments
where the proposed method was applied to the integration
of four bibliographic databases, which scale up to
more than 10 million records, are also reported in the paper.