Constraint-Based Entity Matching

Authors: 
Shen, W; Li, X; Doan, AH
Author: 
Shen, W
Li, X
Doan, A
Year: 
2005
Venue: 
Proc. National Conf. on Artificial Intelligence (AAAI)
URL: 
http://anhai.cs.uiuc.edu/home/papers/cmediate.pdf
Citations: 
58
Citations range: 
50 - 99
AttachmentSize
cmediate[1].pdf180.49 KB

Entity matching is the problem of deciding if two given mentions
in the data, such as \"Helen Hunt\" and \"H. M. Hunt\",
refer to the same real-world entity. Numerous solutions have
been developed, but they have not considered in depth the
problem of exploiting integrity constraints that frequently exist
in the domains. Examples of such constraints include \"a
mention with age two cannot match a mention with salary
200K\" and \"if two paper citations match, then their authors
are likely to match in the same order\". In this paper we describe
a probabilistic solution to entity matching that exploits
such constraints to improve matching accuracy. At the heart
of the solution is a generative model that takes into account
the constraints during the generation process, and provides
well-defined interpretations of the constraints. We describe a
novel combination of EM and relaxation labeling algorithms
that efficiently learns the model, thereby matching mentions
in an unsupervised way, without the need for annotated training
data. Experiments on several real-world domains show
that our solution can exploit constraints to significantly improve
matching accuracy, by 3-12% F-1, and that the solution
scales up to large data sets.