no dataset

Conditional Functional Dependencies for Data Cleaning

Authors: 
Bohannon, Philip; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Kementsietsidis, Anastasios.;
Year: 
2007
Venue: 
ICDE, 2007

We propose a class of constraints, referred to as conditional functional dependencies (CFDs), and study their applications in data cleaning. In contrast to traditional functional dependencies (FDs) that were developed mainly for schema design, CFDs aim at capturing the consistency of data by incorporating bindings of semantic ally related values. For CFDs we provide an inference system analogous to Armstrong's axioms for FDs, as well as consistency analysis.

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Authors: 
Aizawa, A; Oyama, K
Year: 
2005
Venue: 
Web Information Retrieval and Integration

Record linkage refers to techniques for identifying
records associated with the same real-world entities.
Record linkage is not only crucial in integrating
multi-source databases that have been generated independently,
but is also considered to be one of the key
issues in integrating heterogeneous Web resources. However,
when targeting large-scale data, the cost of enumerating
all the possible linkages often becomes impracticably
high. Based on this background, this paper
proposes a fast and efficient method for linkage detection.

Adaptive name matching in information integration

Authors: 
Bilenko, M; Mooney, R; Cohen, W; P Ravikumar, S
Year: 
2003
Venue: 
Intelligent Systems

Identifying approximately duplicate database records that refer to the same entity is essential for information integration. The authors compare and describe methods for combining and learning textual similarity measures for name matching.

Efficient topic-based unsupervised name disambiguation

Authors: 
Song, Y; Huang, J; Councill, IG; Li, J; Giles, CL
Year: 
2007
Venue: 
Proc. 2007 Conf. on Digital libraries

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA).

Self-tuning in graph-based reference disambiguation

Authors: 
Nuray-Turan, R; Kalashnikov, DV; Mehrotra, S
Year: 
2007
Venue: 
Proc. DASFAA 2007

Nowadays many data mining/analysis applications use the
graph analysis techniques for decision making. Many of these techniques
are based on the importance of relationships among the interacting units.
A number of models and measures that analyze the relationship importance
(link structure) have been proposed (e.g., centrality, importance
and page rank) and they are generally based on intuition, where the analyst
intuitively decides a reasonable model that fits the underlying data.
In this paper, we address the problem of learning such models directly

Personal Name Matching: New Test Collections and a Social Network based Approach.

Authors: 
Reuther, P
Year: 
2006
Venue: 
Tech. Report, Univ. Trier

This paper gives an overview of Personal Name Matching. Personal
name matching is of great importance for all applications that deal
with personal names. The problem with personal names is that they
are not unique and sometimes even for one name many variations
exist. This leads to the fact that databases on the one hand may
have several entries for one and the same person and on the other
hand have one entry for many different persons. For the evaluation
of Personal Name Matching algorithms test collections are of great

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Authors: 
Benjelloun, O.; Garcia-Molina, H.; Gong, H.; Kawai, H; Larson, T.E.; Menestrina, D.; Thavisomboon, S.
Year: 
2007
Venue: 
Proc. ICDCS, 2007

Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors.

Biological data cleaning: a case study

Authors: 
Herbert, KG; Wang, JTL
Year: 
2007
Venue: 
International Journal of Information Quality

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardised, methods and frameworks must be developed to handle both structural and traditional data. This paper discusses issues concerning biological data quality with respect to data cleaning.

Approximate string-matching with q-grams and maximal matches

Authors: 
Ukkonen, E
Year: 
1992
Venue: 
Theoretical Computer Science

Ukkonen, E., Approximate string-matching with ¿/-grams and maximal matches. Theoretical Com-
puter Science 92 (1992) 191-211.
We study approximate string-matching in connection with two string distance functions that are
computable in linear time. The first function is based on the so-called ij-grams. An algorithm is given
for the associated string-matching problem that finds the locally best approximate occurrences of
pattern P, |P| = m, in text T, \T\ = n, in time 0(«log(m — q)). The occurrences with distance

A guided tour to approximate string matching

Authors: 
Navarro, G
Year: 
2001
Venue: 
ACM Computing Surveys

We survey the current techniques to cope with the problem of string matching that allows errors. This is becoming a more and more relevant issue for many fast growing areas such as information retrieval and computational biology. We focus on online searching and mostly on edit distance, explaining the problem and its relevance, its statistical behavior, its history and current developments, and the central ideas of the algorithms and their complexities. We present a number of experiments to compare the performance of the different algorithms and show which are the best choices.

Data unification in personal information management

Authors: 
Karger, DR; Jones, W
Year: 
2006
Venue: 
Communications of the ACM

Users need ways to unify, simplify, and consolidate information too often fragmented by location, device, and software application.

Domain-independent data cleaning via analysis of entity-relationship graph

Authors: 
Kalashnikov, DV; Mehrotra, S
Year: 
2006
Venue: 
ACM Transactions on Database Systems (TODS)

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality.

A knowledge-based approach for duplicate elimination in data cleaning

Authors: 
Low, WL; Lee, ML; Ling, TW
Year: 
2001
Venue: 
Information Systems

Existing duplicate elimination methods for data cleaning work on the basis of computing the degree of similarity between nearby records in a sorted database. High recall can be achieved by accepting records with low degrees of similarity as duplicates, at the cost of lower precision. High precision can be achieved analogously at the cost of lower recall. This is the recall-precision dilemma. We develop a generic knowledge-based framework for effective data cleaning that can implement any existing data cleaning strategies and more.

Efficient similarity-based operations for data integration

Authors: 
Schallehn, E; Sattler, KU; Saake, G
Year: 
2004
Venue: 
Data & Knowledge Engineering

Dealing with discrepancies in data is still a big challenge in data integration systems. The problem occurs both during eliminating duplicates from semantic overlapping sources as well as during combining complementary data from different sources. Though using SQL operations like grouping and join seems to be a viable way, they fail if the attribute values of the potential duplicates or related tuples are not equal but only similar by certain criteria. As a solution to this problem, we present in this paper similarity-based variants of grouping and join operators.

Duplicate record identification in bibliographic databases

Authors: 
Goyal, P
Year: 
1987
Venue: 
Information Systems

This study presents the applicability of an automatically generated code for use in duplicate detection in bibliographic databases. It is shown that the methods generate a large percentage of unique codes, and that the code is short enough to be useful. The code would prove to be particularly useful in identifying duplicates when records are added to the database.

On The Accuracy and Completeness of The Record Matching Process

Authors: 
Verykios, VS; Elfeky, MG; AK Elmagarmid, A
Year: 
2000
Venue: 
Proc.2000 Conf. on Information Quality

The role of data resources in today's business environment is multi-faceted. Primarily, they support the operational needs of an organization or a company. Secondarily, they can be used for decision support and management. The quality of the data, used to support the operational needs, is usually below the quality required for decision support and management.

Matching Algorithms within a Duplicate Detection System

Authors: 
Monge, AE
Year: 
2000
Venue: 
IEEE Data Engineering Bulletin

Detecting database records that are approximate duplicates, but not exact duplicates, is an important
task. Databases may contain duplicate records concerning the same real-world entity because of data
entry errors, unstandardized abbreviations, or differences in the detailed schemas of records from multiple
databases – such as what happens in data warehousing where records from multiple data sources are
integrated into a single source of information – among other reasons. In this paper we review a system

Efficient clustering of high-dimensional data sets with application to reference matching

Authors: 
McCallum, A; Nigam, K; Ungar, LH
Year: 
2000
Venue: 
Proc. 6th ACM SIGKDD conf.

Many important problems involve clustering large datasets.
Although naive implementations of clustering are computa-
tionally expensive, there are established efficient techniques
for clustering when the dataset has either (1) a limited num-
ber of clusters, (2) a low feature dimensionality, or (3) a
small number of data points. However, there has been much
less work on methods of efficiently clustering datasets that
are large in all three ways at once|for example, having
millions of data points that exist in many thousands of di-
mensions representing many thousands of clusters.

Source-aware entity matching: A compositional approach

Authors: 
Shen, W.; DeRose, P.; Vu, L.; Doan, A.; Ramakrishnan, R.
Year: 
2007
Venue: 
Proceedings of ICDE 2007

Entity matching (a.k.a. record linkage) plays a crucial
role in integrating multiple data sources, and numerous
matching solutions have been developed. However, the solutions
have largely exploited only information available in
the mentions and employed a single matching technique.
We show how to exploit information about data sources
to significantly improve matching accuracy. In particular,
we observe that different sources often vary substantially
in their level of semantic ambiguity, thus requiring different
matching techniques. In addition, it is often beneficial

Web Service Composition and Record Linking

Authors: 
Cameron, M.A.; Taylor, K.L.; Baxter, R.
Year: 
2004
Venue: 
Proceedings of the Workshop on Information Integration on the Web (IIWeb-2004), Toronto, Canada, 2004

We describe a prototype composition and
runtime environment which together generate
and execute service compositions from service
descriptions and user requirements. We describe
our designs for record linkage services
which have been drawn from existing freely
available software packages. We compare the
performance of a service composition generated
from a user query against a process abstraction
and services for record linking with
that of a standalone record linking application.

Getty's Synoname and its cousins: A survey of applications of personal name-matching algorithms

Authors: 
Borgman, CL; Siegfried, SL
Year: 
1992
Venue: 
Journal of the American Society for Information Science

The study reported in this article was commissioned by
the Getty Art History Information Program (AHIP) as a
background investigation of personal name-matching
programs in fields other than art history, for purposes of
comparing them and their approaches with AHIP’s SynonameTM
project. We review techniques employed in a
variety of applications, including art history, bibliography,
genealogy, commerce, and government, providing a
framework of personal name characteristics, factors in
selecting matching techniques, and types of applications.

Re-identification of Familial Database Records.

Authors: 
Malin, B
Year: 
2006
Venue: 
Proc. AMIA Annual Symp

Many genome-based research projects include familial
relationships, such as pedigrees, with genomic data
records. To protect anonymity when sharing family
information, data holders remove, or encode, explicit
identifiers (e.g. personal name). In this paper,
however, we introduce IdentiFamily, a software
program that can link de-identified family relations to
named people. The program extracts genealogical
knowledge from publicly available records and
ascertains the re-identification risk for specific family
relations. We find robust genealogies on current

An interface for mining genealogical nominal data using the concept of linkage and a hybrid name matching algorithm

Authors: 
Snae, C; Diaz, BM
Year: 
2002
Venue: 
Journal of 3D-Forum Society

This paper describes hybrid name matching algorithms developed to provide nominal data linkage within English parish register data. LIG2 has been shown to perform as well as conventional matching algorithms found in the literature, while its probability version LIG3 provides sufficient flexibility to be included in a Nominal Data Linkage Workbench which allows other dimensions e.g. geographical space, and historical time to be included in the linkage/matching process. The paper reports some initial findings on implementing such a Workbench.

Regelbasierte Ausreißersuche zur Datenqualitätsanalyse

Authors: 
Kübart, J,.; Grimmer, Udo; Hipp, Jochen
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

Kritisch für Datenauswertungen und Datenmigrationen ist die Qualität der zugrunde liegenden Daten. Eine Analyse der Datenqualität ist insbesondere bei großen Datenbeständen jedoch eine nicht triviale Aufgabe. Wir stellen ein Verfahren zur regelbasierten Ausreißersuche in großen Datenbanken vor, das sowohl mit von Experten vorgegebenen Gültigkeitsregeln (\"Geschäftsregeln\") als auch mit automatisch aus Daten erzeugten Regeln eingesetzt werden kann.

A Survey of Data Quality Tools

Authors: 
Barateiro, José; Galhardas, Helena
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 14, 2005

Data quality tools aim at detecting and
correcting data problems that affect the
accuracy and efficiency of data analysis
applications. We propose a classification
of the most relevant commercial and research
data quality tools that can be used
as a framework for comparing tools and
understand their functionalities.

Syndicate content