Group Linkage

Guided search

Click a term to initiate a search.

Keyword search

Tue, 02/06/2007 - 16:16 — Anonymous

Authors:

On, Byung-Won; Koudas, Nick; Lee, Dongwon; Srivastava, Divesh

Author:

On, B

Koudas, N

Lee, D

Srivastava, D

Year:

2007

Venue:

ICDE

URL:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.7107&rep=rep1&type=pdf

Citations:

Citations range:

10 - 49

Attachment	Size
On2007GroupLinkage.pdf	402.98 KB

Poor quality data is prevalent in databases due to a variety
of reasons, including transcription errors, lack of standards
for recording database fields, etc. To be able to query
and integrate such data, considerable recent work has focused
on the record linkage problem, i.e., determine if two
entities represented as relational records are approximately
the same. Often entities are represented as groups of relational
records, rather than individual relational records,
e.g., households in a census survey consist of a group of persons.
We refer to the problem of determining if two entities
represented as groups are approximately the same as group
linkage.
Intuitively, two groups can be linked to each other if
(i) there is high enough similarity between “matching”
pairs of individual records that constitute the two groups,
and (ii) there is a large fraction of such matching record
pairs. In this paper, we formalize this intuition and propose
a group linkage measure based on bipartite graph matching.
Given a data set consisting of a large number of groups,
efficiently finding groups with a high group linkage similarity
to an input query group requires quickly eliminating the
many groups that are unlikely to be desired matches. To
enable this task, we present simpler group similarity measures
that can be used either during fast pre-processing
steps or as approximations to our proposed group linkage
measure. These measures can be easily instantiated using
SQL, permitting our techniques to be implemented inside
the database system itself. We experimentally validate the
utility of our measures and techniques using a variety of
real and synthetic data sets.

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Group Linkage

Related categories

User login