Framework

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Authors: 
Christen, Peter
Year: 
2008
Venue: 
PAKDD

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds

Febrl - A freely available record linkage system with a graphical user interface

Authors: 
Christen, Peter
Year: 
2008
Venue: 
Australasian Workshop Health Data and Knowledge Management

Record or data linkage is an important enabling tech-
nology in the health sector, as linked data is a cost-
effective resource that can help to improve research
into health policies, detect adverse drug reactions, re-
duce costs, and uncover fraud within the health sys-
tem. Significant advances, mostly originating from
data mining and machine learning, have been made
in recent years in many areas of record linkage tech-
niques. Most of these new methods are not yet im-
plemented in current record linkage systems, or are
hidden within ‘black box’ commercial software. This

Learning object identification rules for information integration

Authors: 
Tejada, S; Knoblock, CA; Minton, S
Year: 
2001
Venue: 
Information Systems

When integrating information from multiple websites, the same data objects can exist in inconsistent text formats
across sites, making it difficult to identify matching objects using exact text match. We have developed an object
identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching
objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous
methods of object identification have required manual construction of object identification rules or mapping rules for

Example-driven Design of Efficient Record Matching Queries

Authors: 
Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav
Year: 
2007
Venue: 
VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Adaptive Blocking: Learning to Scale Up Record Linkage

Authors: 
Bilenko, Mikhail; Kamath, Beena; Mooney, Raymond J.
Year: 
2006
Venue: 
ICDM

Many data mining tasks require computing similarity between
pairs of objects. Pairwise similarity computations are
particularly important in record linkage systems, as well as
in clustering and schema mapping algorithms. Because the
number of object pairs grows quadratically with the size of
the dataset, computing similarity between all pairs is impractical
and becomes prohibitive for large datasets and
complex similarity functions. Blocking methods alleviate
this problem by efficiently selecting approximately similar
object pairs for subsequent distance computations, leaving

Improving Data Quality: Consistency and Accuracy

Authors: 
Cong, Gao; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Ma, Shuai
Year: 
2007
Venue: 
VLDB

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D′ that satisfies the constraints and “minimally” differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the “correct” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy.

Conditional Functional Dependencies for Data Cleaning

Authors: 
Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios
Year: 
2007
Venue: 
ICDE

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Conditional Functional Dependencies for Data Cleaning

Authors: 
Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios.;
Year: 
2007
Venue: 
ICDE, 2007

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Web Service Composition and Record Linking

Authors: 
Cameron, M.A.; Taylor, K.L.; Baxter, R.
Year: 
2004
Venue: 
Proceedings of the Workshop on Information Integration on the Web (IIWeb-2004), Toronto, Canada, 2004

We describe a prototype composition and
runtime environment which together generate
and execute service compositions from service
descriptions and user requirements. We describe
our designs for record linkage services
which have been drawn from existing freely
available software packages. We compare the
performance of a service composition generated
from a user query against a process abstraction
and services for record linking with
that of a standalone record linking application.

Completeness of Information Sources

Authors: 
Naumann, Felix; Freytag, Johann-Christoph; Leser, Ulf
Year: 
2004
Venue: 
Information Systems 29(7):583-615

— Information quality plays a crucial role in every ap- plication that integrates data from autonomous sources. However, information quality is hard to measure and complex to consider for the tasks of information integration, even if the integrating sources cooperate. We present a systematic and formal approach to the measurement of information quality and the combination of such measurements for information integration.

Febrl - Freely extensible biomedical record linkage

Authors: 
Christen, Peter; Churches, Tim
Year: 
2002
Venue: 
ANU Computer Science Technical Reports

This manual describes prototype software called Febrl designed to undertake probabilistic data cleaning (or standardisation) and record linkage. Written in the Python programming language, this software aims to allow health, biomedical and other researchers to clean (standardise) and link data sets of all sizes faster, with less effort and with improved quality.

Declarative data cleaning: Language, model, and algorithms

Authors: 
Galhardas, H; Florescu, D; Shasha, D; Simon, E; Saita, C.
Year: 
2001
Venue: 
Proc. VLDB 2001

The problem of data cleaning, which consists of removing
inconsistencies and errors from original data sets, is well known in
the area of decision support systems and data warehouses. This holds
regardless of the application - relational database joining,
web-related, or scientific. In all cases, existing ETL (Extraction
Transformation Loading) and data cleaning tools for writing data
cleaning programs are insufficient. The main challenge is the design
and implementation of a dataflow graph that effectively and
efficiently generates clean data. Needed improvements to the current

AJAX: an extensible data cleaning tool

Authors: 
Galhardas, H; Florescu, D; Shasha, D; Simon, E
Year: 
2000
Venue: 
ACM SIGMOD Record

... groups together matching pairs with a high similarity value by applying a given grouping criteria (e.g. by transitive closure). Finally, ging collapses each individual cluster into a tuple of the resulting data source. AJAX provides @@@@ for specifying data cleaning programs, which consists of SQL statements enriched with a set of specific primitives to express these transformations.AJAX also @@@@. It allows the user to interact with an executing data cleaning program to handle exceptional cases and to inspect intermediate results.

Syndicate content