Evaluation/benchmark

Combining a Logical and a Numerical Method for Data Reconciliation

Fri, 01/15/2010 - 18:04 — sais

Authors:

Saïs, Fatiha; Pernelle, Nathalie; Rousset, Marie-Christine

Year:

2009

Venue:

JoDS - Journal of Data Semantics (LNCS subline, Springer)

The reference reconciliation problem consists in deciding whether diﬀerent identiﬁers refer to the same data, i.e. correspond to the same real world entity. In this article we present a reference reconciliation approach which combines a logical method for reference reconciliation called L2R and a numerical one called N2R. This approach exploits the schema and data semantics, which is translated into a set of Horn FOL rules of reconciliation. These rules are used in L2R to infer exact decisions both of reconciliation and non-reconciliation.

Read more

Personal Name Matching: New Test Collections and a Social Network based Approach.

Thu, 10/18/2007 - 13:07 — cat

Authors:

Reuther, P

Year:

2006

Venue:

Tech. Report, Univ. Trier

This paper gives an overview of Personal Name Matching. Personal
name matching is of great importance for all applications that deal
with personal names. The problem with personal names is that they
are not unique and sometimes even for one name many variations
exist. This leads to the fact that databases on the one hand may
have several entries for one and the same person and on the other
hand have one entry for many different persons. For the evaluation
of Personal Name Matching algorithms test collections are of great

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Thu, 08/30/2007 - 10:09 — cat

Authors:

Benjelloun, O.; Garcia-Molina, H.; Gong, H.; Kawai, H; Larson, T.E.; Menestrina, D.; Thavisomboon, S.

Year:

2007

Venue:

Proc. ICDCS, 2007

Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 15 processors.

Domain-independent data cleaning via analysis of entity-relationship graph

Mon, 04/30/2007 - 07:37 — cat

Authors:

Kalashnikov, DV; Mehrotra, S

Year:

2006

Venue:

ACM Transactions on Database Systems (TODS)

In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality.

On The Accuracy and Completeness of The Record Matching Process

Fri, 04/20/2007 - 13:40 — cat

Authors:

Verykios, VS; Elfeky, MG; AK Elmagarmid, A

Year:

2000

Venue:

Proc.2000 Conf. on Information Quality

The role of data resources in today's business environment is multi-faceted. Primarily, they support the operational needs of an organization or a company. Secondarily, they can be used for decision support and management. The quality of the data, used to support the operational needs, is usually below the quality required for decision support and management.

A method for similarity-based grouping of biological data

Tue, 03/20/2007 - 12:28 — cat

Authors:

Jakoniene, V; Rundqvist, D;Lambrix, P

Year:

2006

Venue:

Proc. DILS06, LNCS 4075

Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases.

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Combining a Logical and a Numerical Method for Data Reconciliation

Personal Name Matching: New Test Collections and a Social Network based Approach.

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Domain-independent data cleaning via analysis of entity-relationship graph

On The Accuracy and Completeness of The Record Matching Process

A method for similarity-based grouping of biological data

User login