Same, Same but Different: A Survey on Duplicate Detection Methods for Situation Awareness

Baumgartner, N; Gottesheim, W; Mitsch, S.;Retschitzegger, W.; Schwinger, W.
Proc. OTM 2009, LNCS 5871

Systems supporting situation awareness typically deal with a vast stream of information about a large number of real-world objects anchored in time and space provided by multiple sources. These sources are often characterized by frequent updates, heterogeneous formats and most crucial, identical, incomplete and often even contradictory information. In this respect, duplicate detection methods are of paramount importance allowing to explore whether or not information having, e.g., different origins or different observation times concern one and the same real-world object.

Tagging of name records for genealogical data browsing

Perrow, Mike; Barber, David
Proc. 6th ACM/IEEE-CS joint conference on Digital libraries

In this paper we present a method of parsing unstructured textual records briefly describing a person and their direct relatives, which we use in the construction of a browsing tool for genealogical data. The records have been created by researchers who are currently digitising a collection of historical archives stored at the Abbaye de Saint-Maurice, Switzerland. The string 'Beatrix, daughter of Johannes Trona, of Saillon' is a typical example of a record. We wish to annotate every term (word and symbol) in our records with a label which describes whether the term is a name (e.g.

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Christen, Peter

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds

Learning Blocking Schemes for Record Linkage

Michelson, Matthew; Knoblock, Craig A.

Record linkage is the process of matching records across data
sets that refer to the same entity. One issue within record
linkage is determining which record pairs to consider, since
a detailed comparison between all of the records is impractical.
Blocking addresses this issue by generating candidate
matches as a preprocessing step for record linkage. For example,
in a person matching problem, blocking might return
all people with the same last name as candidate matches. Two
main problems in blocking are the selection of attributes for

Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

Bilenko, Mikhail; Basu, Sugato; Sahami, Mehran
Fifth IEEE International Conference on Data Mining (ICDM'05)

The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data.

