Industry-scale duplicate detection

Weis, M; Naumann, F; Jehle, U; Lufter, J; Schuster, H
Weis, M
Naumann, F
Jehle, U
Lufter, J
Schuster. H
Citations range: 
10 - 49
1454165.pdf2.71 MB

Duplicate detection is the process of identifying multiple
representations of a same real-world object in a data source.
Duplicate detection is a problem of critical importance in
many applications, including customer relationship manage-
ment, personal information management, or data mining.
In this paper, we present how a research prototype, namely
DogmatiX, which was designed to detect duplicates in hier-
archical XML data, was successfully extended and applied
on a large scale industrial relational database in coopera-
tion with Schufa Holding AG. Schufa’s main business line is
to store and retrieve credit histories of over 60 million in-
dividuals. Here, correctly identifying duplicates is critical
both for individuals and companies: On the one hand, an
incorrectly identified duplicate potentially results in a false
negative credit history for an individual, who will then not
be granted credit anymore. On the other hand, it is essential
for companies that Schufa detects duplicates of a person that
deliberately tries to create a new identity in the database in
order to have a clean credit history.
Besides the quality of duplicate detection, i.e., its effec-
tiveness, scalability cannot be neglected, because of the con-
siderable size of the database. We describe our solution
to coping with both problems and present a comprehensive
evaluation based on large volumes of real-world data.