An efficient domain-independent algorithm for detecting approximately duplicate database records

Guided search

Click a term to initiate a search.

Keyword search

An efficient domain-independent algorithm for detecting approximately duplicate database records

Mon, 10/09/2006 - 13:11 — thor

Authors:

Monge, A.E.; Elkan, C.

Author:

Monge, A

Elkan, C

Year:

1997

Venue:

Research Issues on Data Mining and Knowledge Discovery, 1997

URL:

http://citeseer.ist.psu.edu/monge97efficient.html

Citations:

364

Citations range:

100 - 499

Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. In this paper, we present an efficient algorithm for recognizing clusters of approximately duplicate records. Three key ideas distinguish the algorithm presented. First, a version of the Smith-Waterman algrotihm for computing minimum edit-distance is used as a domain-independent method to recognize pairs of approcimately duplicate records. Second, the union/find algrotihm is used to keep track of clusters of duplicate records incrementally, as pairwse duplicate relationships are discovered. Third, the algorthm uses a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This typically reduces by over 75% the number of times that the expensive pairwise record matching (Smith-Waterman or other) is applied without impairing accuracy. Comprehensive experiments on synthetic databases and on a real database of bibliographic records confirm the effectiveness of the new algorithms.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

An efficient domain-independent algorithm for detecting approximately duplicate database records

Related categories

User login