An Entity Resolution Framework for Deduplicating Proteins

Authors: 
Lochovsky, L; Topaloglou, T
Author: 
Lochovsky, L
Topaloglou, T
Year: 
2008
Venue: 
Lecture Notes in Computer Science
URL: 
http://www.springerlink.com/index/g8w144u643570581.pdf
Citations: 
0
Citations range: 
n/a
AttachmentSize
Lochovsky2008AnEntityResolutionFrameworkforDeduplicatingProteins.pdf1.13 MB

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases. In this paper, we describe a new framework for protein entity resolution, called PERF, which deduplicates protein mentions using a wide range of protein attributes. A mention refers to any recorded information about a protein, whether it is derived from a database, a high-throughput study, or literature text mining, among others. PERF can be easily extended to deduplicate protein-protein interactions (PPIs) as well. This framework translates mentions into instances of a reference schema to facilitate mention comparisons. PERF also uses "virtual attribute dependencies" to "enhance" mentions with additional attribute values. PERF computes a likelihood measure based upon the textual value similarity of mention attributes. A prototype implementation of the framework was tested, and these tests indicate that PERF can clearly separate duplicate mentions from non-duplicate mentions.