A Primitive Operator for Similarity Joins in Data Cleaning

Authors: 
Chaudhuri, S.; Ganti, V.; Kaushik, R.
Author: 
Chaudhuri, S
Ganti, V
Kaushik, R
Year: 
2006
Venue: 
ICDE, 2006
URL: 
http://csdl.computer.org/dl/proceedings/icde/2006/2570/00/25700005.pdf
Citations: 
201
Citations range: 
100 - 499
AttachmentSize
Chaudhuri2006APrimitiveOperatorfor.pdf397.89 KB

Data cleaning based on similarities involves identification
of “close” tuples, where closeness is evaluated using a
variety of similarity functions chosen to suit the domain and
application. Current approaches for efficiently implementing
such similarity joins are tightly tied to the chosen similarity
function. In this paper, we propose a new primitive
operator which can be used as a foundation to implement
similarity joins according to a variety of popular string similarity
functions, and notions of similarity which go beyond
textual similarity. We then propose efficient implementations
for this operator. In an experimental evaluation using real
datasets, we show that the implementation of similarity joins
using our operator is comparable to, and often substantially
better than, previous customized implementations for particular
similarity functions.