Learning metadata from the evidence in an on-line citation matching scheme

Councill, Isaac G.; Li, Huajing; Zhuang, Ziming; Debnath, Sandip; Bolelli, Levent; Lee, Wang-Chien; Sivasubramaniam, Anand; Giles, C. Lee
Councill, I
Li, H
Zhuang, Z
Debnath, S
Bolelli, L
Lee, W
Sivasubramaniam, A
Giles, C
Joint Conference on Digital Libraries 2006 (JCDL 2006): 276-285, 2006
Citations range: 
10 - 49
Councill2006Learningmetadatafromthe.pdf420.96 KB

Citation matching, or the automatic grouping of bibliographic
references that refer to the same document, is a data management
problem faced by automatic digital libraries for scientific
literature such as CiteSeer and Google Scholar. Although several
solutions have been offered for citation matching in large
bibliographic databases, these solutions typically require
expensive batch clustering operations that must be run offline.
Large digital libraries containing citation information can reduce
maintenance costs and provide new services through efficient
online processing of citation data, resolving document citation
relationships as new records become available. Additionally,
information found in citations can be used to supplement
document metadata, requiring the generation of a canonical
citation record from merging variant citation subfields into a
unified “best guess” from which to draw information. Citation
information must be merged with other information sources in
order to provide a complete document record. This paper outlines
a system and algorithms for online citation matching and
canonical metadata generation. A Bayesian framework is
employed to build the ideal citation record for a document that
carries the added advantages of fusing information from disparate
sources and increasing system resilience to erroneous data.