Flexible string matching against large databases in practice

Authors: 
Koudas, N.; Marathe, A.; Srivastava, D.
Author: 
Koudas, N
Marathe, A
Srivastava, D
Year: 
2004
Venue: 
Proceedings of VLDB, 2004
URL: 
http://www.vldb.org/conf/2004/IND3P3.PDF
Citations: 
71
Citations range: 
50 - 99
AttachmentSize
Koudas2004Flexiblestringmatching.pdf143.35 KB

Data Cleaning is an important process that has been at
the center of research interest in recent years. Poor data
quality is the result of a variety of reasons, including
data entry errors and multiple conventions for recording
database fields, and has a significant impact on a variety
of business issues. Hence, there is a pressing need
for technologies that enable flexible (fuzzy) matching
of string information in a database. Cosine similarity
with tf-idf is a well-established metric for comparing
text, and recent proposals have adapted this similarity
measure for flexibly matching a query string with values
in a single attribute of a relation.
In deploying tf-idf based flexible string matching
against real AT&T databases, we observed that this
technique needed to be enhanced in many ways. First,
along the functionality dimension, where there was a
need to flexibly match along multiple string-valued attributes,
and also take advantage of known semantic
equivalences. Second, we identified various performance
enhancements to speed up the matching process,
potentially trading off a small degree of accuracy for
substantial performance gains. In this paper, we report
on our techniques and experience in dealing with flexible
string matching against real AT&T databases.