Framework

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Tue, 05/20/2008 - 09:56 — koepcke

Authors:

Christen, Peter

Year:

2008

Venue:

PAKDD

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds

Read more

Febrl - A freely available record linkage system with a graphical user interface

Tue, 05/20/2008 - 09:52 — koepcke

Authors:

Christen, Peter

Year:

2008

Venue:

Australasian Workshop Health Data and Knowledge Management

Record or data linkage is an important enabling tech-
nology in the health sector, as linked data is a cost-
effective resource that can help to improve research
into health policies, detect adverse drug reactions, re-
duce costs, and uncover fraud within the health sys-
tem. Significant advances, mostly originating from
data mining and machine learning, have been made
in recent years in many areas of record linkage tech-
niques. Most of these new methods are not yet im-
plemented in current record linkage systems, or are
hidden within ‘black box’ commercial software. This

Learning object identification rules for information integration

Tue, 05/20/2008 - 09:11 — koepcke

Authors:

Tejada, S; Knoblock, CA; Minton, S

Year:

2001

Venue:

Information Systems

When integrating information from multiple websites, the same data objects can exist in inconsistent text formats
across sites, making it difficult to identify matching objects using exact text match. We have developed an object
identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching
objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous
methods of object identification have required manual construction of object identification rules or mapping rules for

Example-driven Design of Efficient Record Matching Queries

Wed, 03/19/2008 - 15:10 — koepcke

Authors:

Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav

Year:

2007

Venue:

VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Adaptive Blocking: Learning to Scale Up Record Linkage

Thu, 02/28/2008 - 03:46 — mbilenko

Authors:

Bilenko, Mikhail; Kamath, Beena; Mooney, Raymond J.

Year:

2006

Venue:

ICDM

Many data mining tasks require computing similarity between
pairs of objects. Pairwise similarity computations are
particularly important in record linkage systems, as well as
in clustering and schema mapping algorithms. Because the
number of object pairs grows quadratically with the size of
the dataset, computing similarity between all pairs is impractical
and becomes prohibitive for large datasets and
complex similarity functions. Blocking methods alleviate
this problem by efficiently selecting approximately similar
object pairs for subsequent distance computations, leaving

Improving Data Quality: Consistency and Accuracy

Tue, 01/15/2008 - 16:59 — fgeerts

Authors:

Cong, Gao; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Ma, Shuai

Year:

2007

Venue:

VLDB

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., ﬁnd a repair D′ that satisfies the constraints and “minimally” differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the “correct” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy.

Conditional Functional Dependencies for Data Cleaning

Tue, 01/15/2008 - 16:51 — fgeerts

Authors:

Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios.;

Year:

2007

Venue:

ICDE, 2007

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Read more

Conditional Functional Dependencies for Data Cleaning

Tue, 01/15/2008 - 16:51 — fgeerts

Authors:

Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios

Year:

2007

Venue:

ICDE

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Read more

Web Service Composition and Record Linking

Wed, 04/18/2007 - 08:43 — thor

Authors:

Cameron, M.A.; Taylor, K.L.; Baxter, R.

Year:

2004

Venue:

Proceedings of the Workshop on Information Integration on the Web (IIWeb-2004), Toronto, Canada, 2004

We describe a prototype composition and
runtime environment which together generate
and execute service compositions from service
descriptions and user requirements. We describe
our designs for record linkage services
which have been drawn from existing freely
available software packages. We compare the
performance of a service composition generated
from a user query against a process abstraction
and services for record linking with
that of a standalone record linking application.

Completeness of Information Sources

Mon, 04/09/2007 - 12:55 — fnaumann

Authors:

Naumann, Felix; Freytag, Johann-Christoph; Leser, Ulf

Year:

2004

Venue:

Information Systems 29(7):583-615

— Information quality plays a crucial role in every ap- plication that integrates data from autonomous sources. However, information quality is hard to measure and complex to consider for the tasks of information integration, even if the integrating sources cooperate. We present a systematic and formal approach to the measurement of information quality and the combination of such measurements for information integration.

Read more

Febrl - Freely extensible biomedical record linkage

Mon, 10/16/2006 - 14:23 — massmann

Authors:

Christen, Peter; Churches, Tim

Year:

2002

Venue:

ANU Computer Science Technical Reports

This manual describes prototype software called Febrl designed to undertake probabilistic data cleaning (or standardisation) and record linkage. Written in the Python programming language, this software aims to allow health, biomedical and other researchers to clean (standardise) and link data sets of all sizes faster, with less effort and with improved quality.

Read more

Declarative data cleaning: Language, model, and algorithms

Wed, 09/13/2006 - 15:44 — cat

Authors:

Galhardas, H; Florescu, D; Shasha, D; Simon, E; Saita, C.

Year:

2001

Venue:

Proc. VLDB 2001

The problem of data cleaning, which consists of removing
inconsistencies and errors from original data sets, is well known in
the area of decision support systems and data warehouses. This holds
regardless of the application - relational database joining,
web-related, or scientific. In all cases, existing ETL (Extraction
Transformation Loading) and data cleaning tools for writing data
cleaning programs are insufficient. The main challenge is the design
and implementation of a dataflow graph that effectively and
efficiently generates clean data. Needed improvements to the current

AJAX: an extensible data cleaning tool

Tue, 09/12/2006 - 15:47 — Anonymous

Authors:

Galhardas, H; Florescu, D; Shasha, D; Simon, E

Year:

2000

Venue:

ACM SIGMOD Record

... groups together matching pairs with a high similarity value by applying a given grouping criteria (e.g. by transitive closure). Finally, ging collapses each individual cluster into a tuple of the resulting data source. AJAX provides @@@@ for specifying data cleaning programs, which consists of SQL statements enriched with a set of specific primitives to express these transformations.AJAX also @@@@. It allows the user to interact with an executing data cleaning program to handle exceptional cases and to inspect intermediate results.

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Febrl - A freely available record linkage system with a graphical user interface

Learning object identification rules for information integration

Example-driven Design of Efficient Record Matching Queries

Adaptive Blocking: Learning to Scale Up Record Linkage

Improving Data Quality: Consistency and Accuracy

Conditional Functional Dependencies for Data Cleaning

Conditional Functional Dependencies for Data Cleaning

Web Service Composition and Record Linking

Completeness of Information Sources

Febrl - Freely extensible biomedical record linkage

Declarative data cleaning: Language, model, and algorithms

AJAX: an extensible data cleaning tool

User login