relational

KD2R : a Key Discovery method for semantic Reference Reconcilation.

Thu, 11/03/2011 - 19:24 — sais

Authors:

Symeonidou, Danai; Pernelle, Nathalie; Saïs, Fatiha

Year:

2011

Venue:

The 7th International IFIP Workshop on Semantic Web & Web Semantics (SWWS 2011)

The reference reconciliation problem consists of deciding whether different identiﬁers refer to the same world entity. Some existing reference reconciliation approaches use key constraints to infer reconciliation decisions. In the context of the Linked Open Data, this knowledge is not available. We propose KD2R, a method which allows automatic discovery of key constraints associated to OWL2 classes. These keys are discovered from RDF data which can be incomplete. The proposed algorithm allows this discovery without having to scan all the data.

On active learning of record matching packages

Mon, 04/11/2011 - 08:35 — cat

Authors:

Arasu, A; Götz, M; Kaushik, R.

Year:

2010

Venue:

Proc. ACM SIGMOD Conf.

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult.

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Tue, 05/20/2008 - 09:56 — koepcke

Authors:

Christen, Peter

Year:

2008

Venue:

PAKDD

Linking records from two or more databases is becoming
increasingly important in the data preparation step of many data min-
ing projects, as linked data can enable analysts to conduct studies that
are not feasible otherwise, or that would require expensive and time-
consuming collection of specific data. The aim of such linkages is to match
all records that refer to the same entity. One of the main challenges in
record linkage is the accurate classification of record pairs into matches
and non-matches. With traditional techniques, classification thresholds

Read more

Febrl - A freely available record linkage system with a graphical user interface

Tue, 05/20/2008 - 09:52 — koepcke

Authors:

Christen, Peter

Year:

2008

Venue:

Australasian Workshop Health Data and Knowledge Management

Record or data linkage is an important enabling tech-
nology in the health sector, as linked data is a cost-
effective resource that can help to improve research
into health policies, detect adverse drug reactions, re-
duce costs, and uncover fraud within the health sys-
tem. Significant advances, mostly originating from
data mining and machine learning, have been made
in recent years in many areas of record linkage tech-
niques. Most of these new methods are not yet im-
plemented in current record linkage systems, or are
hidden within ‘black box’ commercial software. This

Learning object identification rules for information integration

Tue, 05/20/2008 - 09:11 — koepcke

Authors:

Tejada, S; Knoblock, CA; Minton, S

Year:

2001

Venue:

Information Systems

When integrating information from multiple websites, the same data objects can exist in inconsistent text formats
across sites, making it difficult to identify matching objects using exact text match. We have developed an object
identification system called Active Atlas, which compares the objects’ shared attributes in order to identify matching
objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous
methods of object identification have required manual construction of object identification rules or mapping rules for

Example-driven Design of Efficient Record Matching Queries

Wed, 03/19/2008 - 15:10 — koepcke

Authors:

Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav

Year:

2007

Venue:

VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

Thu, 02/28/2008 - 03:49 — mbilenko

Authors:

Bilenko, Mikhail; Basu, Sugato; Sahami, Mehran

Year:

2005

Venue:

Fifth IEEE International Conference on Data Mining (ICDM'05)

The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data.

Improving Data Quality: Consistency and Accuracy

Tue, 01/15/2008 - 16:59 — fgeerts

Authors:

Cong, Gao; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Ma, Shuai

Year:

2007

Venue:

VLDB

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., ﬁnd a repair D′ that satisfies the constraints and “minimally” differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the “correct” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy.

Conditional Functional Dependencies for Data Cleaning

Tue, 01/15/2008 - 16:51 — fgeerts

Authors:

Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios.;

Year:

2007

Venue:

ICDE, 2007

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Read more

Conditional Functional Dependencies for Data Cleaning

Tue, 01/15/2008 - 16:51 — fgeerts

Authors:

Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios

Year:

2007

Venue:

ICDE

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

Read more

Completeness of Information Sources

Mon, 04/09/2007 - 12:55 — fnaumann

Authors:

Naumann, Felix; Freytag, Johann-Christoph; Leser, Ulf

Year:

2004

Venue:

Information Systems 29(7):583-615

— Information quality plays a crucial role in every ap- plication that integrates data from autonomous sources. However, information quality is hard to measure and complex to consider for the tasks of information integration, even if the integrating sources cooperate. We present a systematic and formal approach to the measurement of information quality and the combination of such measurements for information integration.

Read more

Relational clustering for multi-type entity resolution

Tue, 10/17/2006 - 14:00 — koepcke

Authors:

Bhattacharya, Indrajit; Getoor, Lise

Year:

2005

Venue:

Conference on Knowledge Discovery in Data

In many applications, there are a variety of ways of referring to the same underlying entity. Given a collection of references to entities, we would like to determine the set of true underlying entities and map the references to these entities. The references may be to entities of different types and more than one type of entity may need to be resolved at the same time. We propose similarity measures for clustering references taking into account the different relations that are observed among the typed references.

Read more

The field matching problem: Algorithms and applications

Mon, 10/09/2006 - 13:04 — thor

Authors:

Monge, A.; Elkan, C.

Year:

1996

Venue:

Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August, 1996

To combine information from heterogeneous sources
equivalent data in the multiple sources must be identified.
This task is the field matching problem. Specifically,
the task is to determine whether or not. two syntactic
values are alternative designations of the same
semantic entity. For example the addresses Dept. of
Comput. Sci. (:real Eng. , University of California, San
Diego, 9500 Gilman Dr. Dept. 0111, La Jolla, (7.4
92093 and UCSD, Computer Science and Engineerirng
Department, CA 92093-01 11 do designate the salve
departntent. This paper describes three field matching

Read more

Duplicate record elimination in large data files

Mon, 10/09/2006 - 09:52 — thor

Authors:

Bitton, D.; DeWitt, D.J.

Year:

1983

Venue:

ACM Transactions on Database Systems (TODS), 8, 1983

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records.

The merge/purge problem for large databases

Thu, 09/14/2006 - 09:56 — Anonymous

Authors:

Hernandez, M.A.; Stolfo, S.J.

Year:

1995

Venue:

Proceedings of the 1995 ACM SIGMOD international conference on Management of data, 1995

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem.

Integration of heterogeneous databases without common domains using queries based on textual similarity

Wed, 09/13/2006 - 15:17 — cat

Authors:

Cohen, WW

Year:

1998

Venue:

Proc. ACM SIGMOD

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both.

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

KD2R : a Key Discovery method for semantic Reference Reconcilation.

On active learning of record matching packages

Automatic Training Example Selection for Scalable Unsupervised Record Linkage

Febrl - A freely available record linkage system with a graphical user interface

Learning object identification rules for information integration

Example-driven Design of Efficient Record Matching Queries

Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

Improving Data Quality: Consistency and Accuracy

Conditional Functional Dependencies for Data Cleaning

Conditional Functional Dependencies for Data Cleaning

Completeness of Information Sources

Relational clustering for multi-type entity resolution

The field matching problem: Algorithms and applications

Duplicate record elimination in large data files

The merge/purge problem for large databases

Integration of heterogeneous databases without common domains using queries based on textual similarity

User login