DogmatiX tracks down duplicates in XML

Guided search

Click a term to initiate a search.

Keyword search

DogmatiX tracks down duplicates in XML

Thu, 09/14/2006 - 21:36 — Anonymous

Authors:

Weis, M; Naumann, F

Author:

Weis, M

Naumann, F

Year:

2005

Venue:

Proceedings of the 2005 ACM SIGMOD international conference

URL:

http://portal.acm.org/citation.cfm?id=1066157.1066207

DOI:

1066157.1066207

Citations:

Citations range:

n/a

Attachment	Size
Weis2005DogmatiXtracksdownduplicatesinXML.pdf	736.82 KB

Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detection in relational data, there is yet little work for duplicates in other, more complex data models, such as XML. In this paper, we present a generalized framework for duplicate detection, dividing the problem into three components: candidate definition defining which objects are to be compared, duplicate definition defining when two duplicate candidates are in fact duplicates, and duplicate detection specifying how to efficiently find those duplicates.Using this framework, we propose an XML duplicate detection method, DogmatiX, which compares XML elements based not only on their direct data values, but also on the similarity of their parents, children, structure, etc. We propose heuristics to determine which of these to choose, as well as a similarity measure specifically geared towards the XML data model. An evaluation of our algorithm using several heuristics validates our approach.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

DogmatiX tracks down duplicates in XML

Related categories

User login