Industry-scale duplicate detection

Guided search

Click a term to initiate a search.

Keyword search

Industry-scale duplicate detection

Wed, 09/16/2009 - 14:29 — thor

Authors:

Weis, M; Naumann, F; Jehle, U; Lufter, J; Schuster, H

Author:

Weis, M

Naumann, F

Jehle, U

Lufter, J

Schuster. H

Year:

2008

Venue:

VLDB

URL:

http://portal.acm.org/citation.cfm?id=1454165

Citations:

Citations range:

10 - 49

Attachment	Size
1454165.pdf	2.71 MB

Duplicate detection is the process of identifying multiple
representations of a same real-world object in a data source.
Duplicate detection is a problem of critical importance in
many applications, including customer relationship manage-
ment, personal information management, or data mining.
In this paper, we present how a research prototype, namely
DogmatiX, which was designed to detect duplicates in hier-
archical XML data, was successfully extended and applied
on a large scale industrial relational database in coopera-
tion with Schufa Holding AG. Schufa’s main business line is
to store and retrieve credit histories of over 60 million in-
dividuals. Here, correctly identifying duplicates is critical
both for individuals and companies: On the one hand, an
incorrectly identified duplicate potentially results in a false
negative credit history for an individual, who will then not
be granted credit anymore. On the other hand, it is essential
for companies that Schufa detects duplicates of a person that
deliberately tries to create a new identity in the database in
order to have a clean credit history.
Besides the quality of duplicate detection, i.e., its effec-
tiveness, scalability cannot be neglected, because of the con-
siderable size of the database. We describe our solution
to coping with both problems and present a comprehensive
evaluation based on large volumes of real-world data.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Industry-scale duplicate detection

Related categories

User login