Learning Blocking Schemes for Record Linkage

Authors: 
Michelson, Matthew; Knoblock, Craig A.
Author: 
Michelson, M
Knoblock, C
Year: 
2006
Venue: 
AAAI
URL: 
www.isi.edu/~michelso/paps/aaai06.pdf
Citations: 
58
Citations range: 
50 - 99

Record linkage is the process of matching records across data
sets that refer to the same entity. One issue within record
linkage is determining which record pairs to consider, since
a detailed comparison between all of the records is impractical.
Blocking addresses this issue by generating candidate
matches as a preprocessing step for record linkage. For example,
in a person matching problem, blocking might return
all people with the same last name as candidate matches. Two
main problems in blocking are the selection of attributes for
generating the candidate matches and deciding which methods
to use to compare the selected attributes. These attribute
and method choices constitute a blocking scheme. Previous
approaches to record linkage address the blocking issue
in a largely ad-hoc fashion. This paper presents a machine
learning approach to automatically learn effective blocking
schemes. We validate our approach with experiments that
show our learned blocking schemes outperform the ad-hoc
blocking schemes of non-experts and perform comparably to
those manually built by a domain expert.