Duplicate/matching

On active learning of record matching packages

Authors: 
Arasu, A; Götz, M; Kaushik, R.
Year: 
2010
Venue: 
Proc. ACM SIGMOD Conf.

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult.

Record Matching over Query Results from Multiple Web Databases

Authors: 
Su, W; Wang, J; Lochovsky, F.H.
Year: 
2010
Venue: 
IEEE Transactions on Knowledge and Data Engineering

Record matching, which identifies the records that represent the same real-world entity, is an important step for data
integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These
methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated onthe-
fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail on

Some methods for blindfolded record linkage

Authors: 
Churches, T; Christen, P
Year: 
2004
Venue: 
BMC Medical Informatics and Decision Making

The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality.

HARRA: fast iterative hashed record linkage for large-scale data collections

Authors: 
Kim, H; Lee, D
Year: 
2010
Venue: 
Proc. 13th Int. Conf. EDBT

We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA. The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets.

Same, Same but Different: A Survey on Duplicate Detection Methods for Situation Awareness

Authors: 
Baumgartner, N; Gottesheim, W; Mitsch, S.;Retschitzegger, W.; Schwinger, W.
Year: 
2009
Venue: 
Proc. OTM 2009, LNCS 5871

Systems supporting situation awareness typically deal with a vast stream of information about a large number of real-world objects anchored in time and space provided by multiple sources. These sources are often characterized by frequent updates, heterogeneous formats and most crucial, identical, incomplete and often even contradictory information. In this respect, duplicate detection methods are of paramount importance allowing to explore whether or not information having, e.g., different origins or different observation times concern one and the same real-world object.

idMesh: graph-based disambiguation of linked data

Authors: 
Cudre-Mauroux, P; Jost, M; Meer, H De
Year: 
2009
Venue: 
Proceedings 18th WWW conf.

We tackle the problem of disambiguating entities on the Web. We
propose a user-driven scheme where graphs of entities – represented
by globally identifiable declarative artifacts – self-organize
in a dynamic and probabilistic manner. Our solution has the following
two desirable properties: i) it lets end-users freely define
associations between arbitrary entities and ii) it probabilistically infers
entity relationships based on uncertain links using constraintsatisfaction
mechanisms. We outline the interface between our

Combining a Logical and a Numerical Method for Data Reconciliation

Authors: 
Saïs, Fatiha; Pernelle, Nathalie; Rousset, Marie-Christine
Year: 
2009
Venue: 
JoDS - Journal of Data Semantics (LNCS subline, Springer)

The reference reconciliation problem consists in deciding whether different identifiers refer to the same data, i.e. correspond to the same real world entity. In this article we present a reference reconciliation approach which combines a logical method for reference reconciliation called L2R and a numerical one called N2R. This approach exploits the schema and data semantics, which is translated into a set of Horn FOL rules of reconciliation. These rules are used in L2R to infer exact decisions both of reconciliation and non-reconciliation.

Data Quality: Concepts, Methodologies and Techniques

Authors: 
Batini, C.; Scannapieca, M.
Year: 
2006

An Entity Resolution Framework for Deduplicating Proteins

Authors: 
Lochovsky, L; Topaloglou, T
Year: 
2008
Venue: 
Lecture Notes in Computer Science

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases.

Learning Blocking Schemes for Record Linkage

Authors: 
Michelson, Matthew; Knoblock, Craig A.
Year: 
2006
Venue: 
AAAI

Record linkage is the process of matching records across data
sets that refer to the same entity. One issue within record
linkage is determining which record pairs to consider, since
a detailed comparison between all of the records is impractical.
Blocking addresses this issue by generating candidate
matches as a preprocessing step for record linkage. For example,
in a person matching problem, blocking might return
all people with the same last name as candidate matches. Two
main problems in blocking are the selection of attributes for

Example-driven Design of Efficient Record Matching Queries

Authors: 
Chaudhuri, Surajit;Chen, Bee-Chung;Ganti, Venkatesh;Kaushik, Raghav
Year: 
2007
Venue: 
VLDB

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as approximate string matching (e.g., edit distances) and use of external reference data sources. Record matching can be viewed as a query composed of a small set of primitive operators. However, formulating such record matching queries is difficult and depends on the specific application scenario.

Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping

Authors: 
Bilenko, Mikhail; Basu, Sugato; Sahami, Mehran
Year: 
2005
Venue: 
Fifth IEEE International Conference on Data Mining (ICDM'05)

The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data.

Adaptive Blocking: Learning to Scale Up Record Linkage

Authors: 
Bilenko, Mikhail; Kamath, Beena; Mooney, Raymond J.
Year: 
2006
Venue: 
ICDM

Many data mining tasks require computing similarity between
pairs of objects. Pairwise similarity computations are
particularly important in record linkage systems, as well as
in clustering and schema mapping algorithms. Because the
number of object pairs grows quadratically with the size of
the dataset, computing similarity between all pairs is impractical
and becomes prohibitive for large datasets and
complex similarity functions. Blocking methods alleviate
this problem by efficiently selecting approximately similar
object pairs for subsequent distance computations, leaving

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Authors: 
Aizawa, A; Oyama, K
Year: 
2005
Venue: 
Web Information Retrieval and Integration

Record linkage refers to techniques for identifying
records associated with the same real-world entities.
Record linkage is not only crucial in integrating
multi-source databases that have been generated independently,
but is also considered to be one of the key
issues in integrating heterogeneous Web resources. However,
when targeting large-scale data, the cost of enumerating
all the possible linkages often becomes impracticably
high. Based on this background, this paper
proposes a fast and efficient method for linkage detection.

Self-tuning in graph-based reference disambiguation

Authors: 
Nuray-Turan, R; Kalashnikov, DV; Mehrotra, S
Year: 
2007
Venue: 
Proc. DASFAA 2007

Nowadays many data mining/analysis applications use the
graph analysis techniques for decision making. Many of these techniques
are based on the importance of relationships among the interacting units.
A number of models and measures that analyze the relationship importance
(link structure) have been proposed (e.g., centrality, importance
and page rank) and they are generally based on intuition, where the analyst
intuitively decides a reasonable model that fits the underlying data.
In this paper, we address the problem of learning such models directly

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Authors: 
Benjelloun, O.; Garcia-Molina, H.; Gong, H.; Kawai, H; Larson, T.E.; Menestrina, D.; Thavisomboon, S.
Year: 
2007
Venue: 
Proc. ICDCS, 2007

Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors.

Domain-independent data cleaning via analysis of entity-relationship graph

Authors: 
Kalashnikov, DV; Mehrotra, S
Year: 
2006
Venue: 
ACM Transactions on Database Systems (TODS)

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality.

A knowledge-based approach for duplicate elimination in data cleaning

Authors: 
Low, WL; Lee, ML; Ling, TW
Year: 
2001
Venue: 
Information Systems

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more.

Efficient similarity-based operations for data integration

Authors: 
Schallehn, E; Sattler, KU; Saake, G
Year: 
2004
Venue: 
Data & Knowledge Engineering

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators.

Matching Algorithms within a Duplicate Detection System

Authors: 
Monge, AE
Year: 
2000
Venue: 
IEEE Data Engineering Bulletin

Detecting database records that are approximate duplicates, but not exact duplicates, is an important
task. Databases may contain duplicate records concerning the same real-world entity because of data
entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple
databases – such as what happens in data warehousing where records from multiple data sources are
integrated into a single source of information – among other reasons. In this paper we review a system

Efficient clustering of high-dimensional data sets with application to reference matching

Authors: 
McCallum, A; Nigam, K; Ungar, LH
Year: 
2000
Venue: 
Proc. 6th ACM SIGKDD conf.

Many important problems involve clustering large datasets.
Although naive implementations of clustering are computa-
tionally expensive, there are established efficient techniques
for clustering when the dataset has either (1) a limited num-
ber of clusters, (2) a low feature dimensionality, or (3) a
small number of data points. However, there has been much
less work on methods of efficiently clustering datasets that
are large in all three ways at once|for example, having
millions of data points that exist in many thousands of di-
mensions representing many thousands of clusters.

Source-aware entity matching: A compositional approach

Authors: 
Shen, W.; DeRose, P.; Vu, L.; Doan, A.; Ramakrishnan, R.
Year: 
2007
Venue: 
Proceedings of ICDE 2007

Entity matching (a.k.a. record linkage) plays a crucial
role in integrating multiple data sources, and numerous
matching solutions have been developed. However, the solutions
have largely exploited only information available in
the mentions and employed a single matching technique.
We show how to exploit information about data sources
to significantly improve matching accuracy. In particular,
we observe that different sources often vary substantially
in their level of semantic ambiguity, thus requiring different
matching techniques. In addition, it is often beneficial

Web Service Composition and Record Linking

Authors: 
Cameron, M.A.; Taylor, K.L.; Baxter, R.
Year: 
2004
Venue: 
Proceedings of the Workshop on Information Integration on the Web (IIWeb-2004), Toronto, Canada, 2004

We describe a prototype composition and
runtime environment which together generate
and execute service compositions from service
descriptions and user requirements. We describe
our designs for record linkage services
which have been drawn from existing freely
available software packages. We compare the
performance of a service composition generated
from a user query against a process abstraction
and services for record linking with
that of a standalone record linking application.

Exploiting secondary sources for automatic object consolidation

Authors: 
Michalowski, M; Thakkar, S; Knoblock, CA
Year: 
2003
Venue: 
Proc. 2003 ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation

Information sources on the web are controlled by different
organizations or people, utilize different text formats, and
have varying inconsistencies. Therefore, any system that integrates
information from different data sources must consolidate
data from these sources. Data from many data
sources on the web may not contain enough information to
accurately consolidate the data even using state of the art
object consolidation systems. We present an approach to
accurately and automatically consolidate data from various
data sources by utilizing a state of the art object consolidation

A Latent Dirichlet Model for Unsupervised Entity Resolution

Authors: 
Bhattacharya, I.; Getoor, L.;
Year: 
2006
Venue: 
The SIAM International Conference on Data Mining (SIAM-SDM), 2006

Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other.

Syndicate content