A Duplicate Detection Benchmark for XML (and Relational) Data

Guided search

Click a term to initiate a search.

Keyword search

A Duplicate Detection Benchmark for XML (and Relational) Data

Wed, 10/11/2006 - 10:59 — cat

Authors:

Weis, M.; Naumann, F.; Brosy, F.

Author:

Weis, M

Naumann, F

Brosy, F

Year:

2006

Venue:

Proc. Workshop on Information Quality for Information Systems (IQIS)

URL:

http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/benchmark_iqis06.pdf

Citations:

Citations range:

10 - 49

Attachment	Size
Weis2006ADuplicateDetectionBenchmark.pdf	127.47 KB

Duplicate detection, which is an important subtask of data
cleaning, is the task of identifying multiple representations of a
same real-world object. Numerous approaches both for relational
and XML data exist. Their goals are either on improving the quality
of the detected duplicates (effectiveness) or on saving computation
time (efficiency). In particular for the first goal, the “goodness”
of an approach is usually evaluated based on experimental
studies. Although some methods and data sets have gained popularity,
it is still difficult to compare different approaches or to
assess the quality of one own’s approach. This difficulty of comparison
is mainly due to lack of documentation of algorithms and
the data, software and hardware used and/or limited resources not
allowing to rebuild systems described by others.
In this paper, we propose a benchmark for duplicate detection,
specialized to XML, which can be part of a broader duplicate detection
or even data cleansing benchmark. We discuss all necessary
elements to make up a benchmark: Data provisioning, clearly
defined operations (the benchmark workload), and metrics to evaluate
the quality. The proposed benchmark is a step forward to
representative comparisons of duplicate detection algorithms. We
note that this benchmark is yet to be implemented and this paper
is meant to be a starting point for discussion.

informatik.hu-berlin.de

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

A Duplicate Detection Benchmark for XML (and Relational) Data

Related categories

User login