Eliminating fuzzy duplicates in data warehouses

Guided search

Click a term to initiate a search.

Keyword search

Eliminating fuzzy duplicates in data warehouses

Wed, 09/13/2006 - 15:05 — Anonymous

Authors:

Ananthakrishna, R; Chaudhuri, S; Ganti, V

Author:

Ananthakrishna, R

Chaudhuri, S

Ganti, V

Year:

2002

Venue:

VLDB 2002

URL:

http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/papers/S17P01.pdf

Citations:

334

Citations range:

100 - 499

Attachment	Size
Ananthakrishna2002Eliminatingfuzzyduplicatesin.pdf	193.32 KB

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Eliminating fuzzy duplicates in data warehouses

Related categories

User login