Flexible string matching against large databases in practice

Guided search

Click a term to initiate a search.

Keyword search

Flexible string matching against large databases in practice

Mon, 10/09/2006 - 12:58 — thor

Authors:

Koudas, N.; Marathe, A.; Srivastava, D.

Author:

Koudas, N

Marathe, A

Srivastava, D

Year:

2004

Venue:

Proceedings of VLDB, 2004

URL:

http://www.vldb.org/conf/2004/IND3P3.PDF

Citations:

Citations range:

50 - 99

Attachment	Size
Koudas2004Flexiblestringmatching.pdf	143.35 KB

Data Cleaning is an important process that has been at
the center of research interest in recent years. Poor data
quality is the result of a variety of reasons, including
data entry errors and multiple conventions for recording
database fields, and has a significant impact on a variety
of business issues. Hence, there is a pressing need
for technologies that enable flexible (fuzzy) matching
of string information in a database. Cosine similarity
with tf-idf is a well-established metric for comparing
text, and recent proposals have adapted this similarity
measure for flexibly matching a query string with values
in a single attribute of a relation.
In deploying tf-idf based flexible string matching
against real AT&T databases, we observed that this
technique needed to be enhanced in many ways. First,
along the functionality dimension, where there was a
need to flexibly match along multiple string-valued attributes,
and also take advantage of known semantic
equivalences. Second, we identified various performance
enhancements to speed up the matching process,
potentially trading off a small degree of accuracy for
substantial performance gains. In this paper, we report
on our techniques and experience in dealing with flexible
string matching against real AT&T databases.

research.att.com

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Flexible string matching against large databases in practice

Related categories

User login