distributed (n>1)

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Authors: 
Christen, Peter
Year: 
2008
Venue: 
PAKDD

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds

Febrl - A freely available record linkage system with a graphical user interface

Authors: 
Christen, Peter
Year: 
2008
Venue: 
Australasian Workshop Health Data and Knowledge Management

Record or data linkage is an important enabling tech-
nology in the health sector, as linked data is a cost-
effective resource that can help to improve research
into health policies, detect adverse drug reactions, re-
duce costs, and uncover fraud within the health sys-
tem. Significant advances, mostly originating from
data mining and machine learning, have been made
in recent years in many areas of record linkage tech-
niques. Most of these new methods are not yet im-
plemented in current record linkage systems, or are
hidden within ‘black box’ commercial software. This

Learning object identification rules for information integration

Authors: 
Tejada, S; Knoblock, CA; Minton, S
Year: 
2001
Venue: 
Information Systems

When integrating information from multiple websites, the same data objects can exist in inconsistent text formats
across sites, making it difficult to identify matching objects using exact text match. We have developed an object
identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching
objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous
methods of object identification have required manual construction of object identification rules or mapping rules for

Example-driven Design of Efficient Record Matching Queries

Authors: 
Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav
Year: 
2007
Venue: 
VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Adaptive Blocking: Learning to Scale Up Record Linkage

Authors: 
Bilenko, Mikhail; Kamath, Beena; Mooney, Raymond J.
Year: 
2006
Venue: 
ICDM

Many data mining tasks require computing similarity between
pairs of objects. Pairwise similarity computations are
particularly important in record linkage systems, as well as
in clustering and schema mapping algorithms. Because the
number of object pairs grows quadratically with the size of
the dataset, computing similarity between all pairs is impractical
and becomes prohibitive for large datasets and
complex similarity functions. Blocking methods alleviate
this problem by efficiently selecting approximately similar
object pairs for subsequent distance computations, leaving

A method for similarity-based grouping of biological data

Authors: 
Jakoniene, V; Rundqvist, D;Lambrix, P
Year: 
2006
Venue: 
Proc. DILS06, LNCS 4075

Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases.

Syndicate content