Industry-scale duplicate detection

Authors: 
Weis, M; Naumann, F; Jehle, U; Lufter, J; Schuster, H
Author: 
Weis, M
Naumann, F
Jehle, U
Lufter, J
Schuster. H
Year: 
2008
Venue: 
VLDB
URL: 
http://portal.acm.org/citation.cfm?id=1454165
Citations: 
26
Citations range: 
10 - 49
AttachmentSize
1454165.pdf2.71 MB

Duplicate detection is the process of identifying multiple
representations of a same real-world object in a data source.
Duplicate detection is a problem of critical importance in
many applications, including customer relationship manage-
ment, personal information management, or data mining.
In this paper, we present how a research prototype, namely
DogmatiX, which was designed to detect duplicates in hier-
archical XML data, was successfully extended and applied
on a large scale industrial relational database in coopera-
tion with Schufa Holding AG. Schufa’s main business line is
to store and retrieve credit histories of over 60 million in-
dividuals. Here, correctly identifying duplicates is critical
both for individuals and companies: On the one hand, an
incorrectly identified duplicate potentially results in a false
negative credit history for an individual, who will then not
be granted credit anymore. On the other hand, it is essential
for companies that Schufa detects duplicates of a person that
deliberately tries to create a new identity in the database in
order to have a clean credit history.
Besides the quality of duplicate detection, i.e., its effec-
tiveness, scalability cannot be neglected, because of the con-
siderable size of the database. We describe our solution
to coping with both problems and present a comprehensive
evaluation based on large volumes of real-world data.