A comparison of fast blocking methods for record linkage

Authors: 
Baxter, R; Christen, P; Churches, T
Author: 
Baxter, R
Christen, P
Churches, T
Year: 
2003
Venue: 
ACM SIGKDD
URL: 
http://cuttlefish.anu.edu.au/publications/2003/kdd03-6pages.pdf
Citations: 
203
Citations range: 
100 - 499
AttachmentSize
Baxter2003Acomparisonoffastblocking.pdf138.44 KB

Record linkage of millions of individual health records for ethically-approved research purposes is a computationally expensive task. Blocking methods are used in record linkage systems to reduce the number of candidate record comparison pairs to a feasible number whilst still maintaining linkage accuracy. New blocking methods have been implemented recently using high-dimensional indexing or clustering algorithms. We compare two new blocking methods, bigram indexing and canopy clustering with TFIDF (Term Frequency/Inverse Document Frequency), with two older methods of standard traditional blocking and sorted neighbourhood blocking. The results show that recently blocking methods such as bigram indexing and canopy clustering provide scalable blocking methods while maintaining or improving upon record linkage accuracy. There is a potential for large performance speed-ups and better accuracy to be achieved by these new blocking methods.