microsoft.com

Incorporating string transformations in record matching

Authors: 
Arasu, A; Chaudhuri, S; Ganjam, K; Kaushik, R
Year: 
2008
Venue: 
SIGMOD

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record matching to take such user-defined string transformations as input. These transformations coupled with an underlying similarity function are used to define the similarity between two strings.

Leveraging aggregate constraints for deduplication

Authors: 
Chaudhuri, S; Sarma, AD; Ganti, V; Kaushik, R
Year: 
2007
Venue: 
SIGMOD

We show that aggregate constraints (as opposed to pair-
wise constraints) that often arise when integrating multiple
sources of data, can be leveraged to enhance the quality of
deduplication. However, despite its appeal, we show that the
problem is challenging, both semantically and computation-
ally. We define a restricted search space for deduplication
that is intuitive in our context and we solve the problem
optimally for the restricted space. Our experiments on real
data show that incorporating aggregate constraints signifi-
cantly enhances the accuracy of deduplication.

Similarity Group-By

Authors: 
Silva, Yasin N.; Aref, Walid G.; Ali, Mohamed H.
Year: 
2009
Venue: 
ICDE

Group-by is a core database operation that is used extensively in OLTP, OLAP, and decision support systems. In many application scenarios, it is required to group similar but not necessarily equal values. In this paper we propose a new SQL construct that supports similarity-based Group-by (SGB). SGB is not a new clustering algorithm, but rather is a practical and fast similarity grouping query operator that is compatible with other SQL operators and can be combined with them to answer similarity-based queries efficiently.

Large-Scale Deduplication with Constraints Using Dedupalog

Authors: 
Arasu, Arvind; Ré, Christopher; Suciu, Dan
Year: 
2009
Venue: 
ICDE

We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue''; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above.

A grammar-based entity representation framework for data cleaning

Authors: 
Arasu, Arvind; Kaushik, Raghav
Year: 
2009
Venue: 
SIGMOD

Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing.

Exploiting context analysis for combining multiple entity resolution systems

Authors: 
Chen, Zhaoqi; Kalashnikov, Dmitri V.; Mehrotra, Sharad
Year: 
2009
Venue: 
SIGMOD

Entity Resolution (ER) is an important real world problem that has attracted significant research interest over the past few years. It deals with determining which object descriptions co-refer in a dataset. Due to its practical significance for data mining and data analysis tasks many different ER approaches has been developed to address the ER challenge. This paper proposes a new ER Ensemble framework. The task of ER Ensemble is to combine the results of multiple base-level ER systems into a single solution with the goal of increasing the quality of ER.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching

Authors: 
Chaudhuri, Surajit; Ganti, Venkatesh; Xin, Dong
Year: 
2009
Venue: 
VLDB

Many entity extraction techniques leverage large reference
entity tables to identify entities in documents. Often, an
entity is referenced in document collections differerently from
that in the reference entity tables. Therefore, we study the
problem of determining whether or not a substring "approx-
imately" matches with a reference entity. Similarity mea-
sures which exploit the correlation between candidate sub-
strings and reference entities across a large number of doc-
uments are known to be more robust than traditional stand

Name Disambiguation Using Web Connection

Authors: 
Lu, Y; Nie, Z; Cheng, T; Gao, Y; Wen, JR
Year: 
2007
Venue: 
Proceedings of AAAI 2007 Workshop on Information Integration ...

to the same person, it is very likely that they share some Name disambiguation is an important challenge in data coauthors, references, or are indirectly related by a chain of cleaning.

Improving Data Quality: Consistency and Accuracy

Authors: 
Cong, Gao; Fan, Wenfei; Geerts, Floris; Jia, Xibei; Ma, Shuai;
Year: 
2007
Venue: 
VLDB, 2007

Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D′ that satisfies the constraints and “minimally” differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the “correct” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy.

A Primitive Operator for Similarity Joins in Data Cleaning

Authors: 
Chaudhuri, S.; Ganti, V.; Kaushik, R.
Year: 
2006
Venue: 
ICDE, 2006

Data cleaning based on similarities involves identification
of “close” tuples, where closeness is evaluated using a
variety of similarity functions chosen to suit the domain and
application. Current approaches for efficiently implementing
such similarity joins are tightly tied to the chosen similarity
function. In this paper, we propose a new primitive
operator which can be used as a foundation to implement
similarity joins according to a variety of popular string similarity
functions, and notions of similarity which go beyond

Data cleaning in microsoft SQL server 2005

Authors: 
Chaudhuri, S.; Ganjam, K.; Ganti, V.; Kapoor, R.; Narasayya, V.; Vassilakis, T.
Year: 
2005
Venue: 
Proc. ACM SIGMOD 2005 (Demo)

When collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive, challenge. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values.

Eliminating fuzzy duplicates in data warehouses

Authors: 
Ananthakrishna, R; Chaudhuri, S; Ganti, V
Year: 
2002
Venue: 
VLDB 2002

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions.

An overview of data warehousing and OLAP technology

Authors: 
Chaudhuri, S; Dayal, U
Year: 
1997
Venue: 
ACM SIGMOD Record

Robust Identification of Fuzzy Duplicates

Authors: 
Chaudhuri, Surajit; Ganti, Venkatesh; Motwani, Rajeev
Year: 
2005
Venue: 
ICDE, 2005

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches.

Syndicate content