cs.washington.edu

Large-Scale Deduplication with Constraints Using Dedupalog

Authors: 
Arasu, Arvind; Ré, Christopher; Suciu, Dan
Year: 
2009
Venue: 
ICDE

We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue''; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above.

Object identification with attribute-mediated dependences

Authors: 
Singla, P; Domingos, P
Year: 
2005
Venue: 
Proceedings of PKDD-2005

Object identifcation is the problem of determining whether
different observations correspond to the same object. It occurs in a wide
variety of fields, including vision, natural language, citation matching,
and information integration. Traditionally, the problem is solved separately
for each pair of observations, followed by transitive closure. We
propose solving it collectively, performing simultaneous inference for all
candidate match pairs, and allowing information to propagate from one
candidate match to another via the attributes they have in common. Our

Multi-relational record linkage

Authors: 
Singla, P.; Domingos, P.
Year: 
2004
Venue: 
KDD Workshop on Multi-Relational Data Mining, Seattle, WA, August, 2004

Data cleaning and integration is typically the most expensive step in the KDD process. A key part, known as record linkage or
de-duplication, is identifying which records in a database refer to the
same entities. This problem is traditionally solved separately for each
candidate record pair (followed by transitive closure). We propose to use
instead a multi-relational approach, performing simultaneous inference
for all candidate pairs, and allowing information to propagate from one
candidate match to another via the attributes they have in common. Our

Reference reconciliation in complex information spaces

Authors: 
Dong, X.; Halevy, A.; Madhavan, J.
Year: 
2005
Venue: 
Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005

Reference reconciliation is the problem of identifying when different references (i.e., sets of attribute values) in a dataset correspond to the same real-world entity. Most previous literature assumed references to a single class that had a fair number of attributes (e.g., research publications). We consider complex information spaces: our references belong to multiple related classes and each reference may have very few attribute values.

Syndicate content