Approximate string joins in a database (almost) for free

Guided search

Click a term to initiate a search.

Keyword search

Approximate string joins in a database (almost) for free

Mon, 10/09/2006 - 12:43 — thor

Authors:

Gravano, L.; Ipeirotis, P.G.; Jagadish, H.V.; Koudas, N.; Muthukrishnan, S.; Srivastava, D.

Author:

Gravano, L

Ipeirotis, P

Jagadish, H

Koudas, N

Muthukrishnan, S

Srivastava, D

Year:

2001

Venue:

Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), 2001

URL:

http://pages.stern.nyu.edu/~panos/publications/vldb2001.pdf

Citations:

411

Citations range:

100 - 499

Attachment	Size
Gravano2001Approximatestringjoinsina.pdf	189.16 KB

String data is ubiquitous, and its management has
taken on particular importance in the past few
years. Approximate queries are very important on
string data especially for more complex queries
involving joins. This is due, for example, to the
prevalence of typographical errors in data, and
multiple conventions for recording attributes such
as name and address. Commercial databases do
not support approximate string joins directly, and
it is a challenge to implement this functionality efficiently
with user-defined functions (UDFs).
In this paper, we develop a technique for building
approximate string join capabilities on top of
commercial databases by exploiting facilities already
available in them. At the core, our technique
relies on matching short substrings of length
q, called q-grams, and taking into account both
positions of individual matches and the total number
of such matches. Our approach applies to both
approximate full string matching and approximate
substring matching, with a variety of possible edit
distance functions. The approximate string match
predicate, with a suitable edit distance threshold,
can be mapped into a vanilla relational expression
and optimized by conventional relational optimizers.
We demonstrate experimentally the benefits
of our technique over the direct use of UDFs, using
commercial database systems and real data.
To study the I/O and CPU behavior of approximate
string join algorithms with variations in edit
distance and q-gram length, we also describe detailed
experiments based on a prototype implementation.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Approximate string joins in a database (almost) for free

Related categories

User login