Citation Matching

Duplicate record identification in bibliographic databases

Goyal, P
Information Systems

This study presents the applicability of an automatically generated code for use in duplicate detection in bibliographic databases. It is shown that the methods generate a large percentage of unique codes, and that the code is short enough to be useful. The code would prove to be particularly useful in identifying duplicates when records are added to the database.

Efficient clustering of high-dimensional data sets with application to reference matching

McCallum, A; Nigam, K; Ungar, LH
Proc. 6th ACM SIGKDD conf.

Many important problems involve clustering large datasets.
Although naive implementations of clustering are computa-
tionally expensive, there are established efficient techniques
for clustering when the dataset has either (1) a limited num-
ber of clusters, (2) a low feature dimensionality, or (3) a
small number of data points. However, there has been much
less work on methods of efficiently clustering datasets that
are large in all three ways at once|for example, having
millions of data points that exist in many thousands of di-
mensions representing many thousands of clusters.

Search engine driven author disambiguation

Tan, YF; Kan, MY; Lee, D
Proc. 6th ACM/IEEE-CS joint conf. on Digital Libraries

In scholarly digital libraries, author disambiguation is an important task that attributes a scholarly work with specific authors. This is critical when individuals share the same name. We present an approach to this task that analyzes the results of automatically-crafted web searches. A key observation is that pages from rare web sites are stronger source of evidence than pages from common web sites, which we model as Inverse Host Frequency (IHF). Our system is able to achieve an average accuracy of 0.836.

Large-Scale Citation Matching of Scientific Digital Libraries

Lee, D.; Kang, J.; Mitra, P.; Giles, C. Lee; On, B.-W.

In scientific Digital Libraries, citations play
an important role such as locating relevant research or
estimating the impact of an article. Therefore, to avoid
the so-called “garbage-in garbage-out” problem, the quality
of citations must be maintained to its utmost degree.
Despite the advancement in DLs, however, the maintenance
of citations face new challenges. In this paper,
we present four new scenarios where matching, linking,
and integrating citations becomes a challenge. Then, we
discuss a few proposals to cope with the challenges. Although

Are Your Citations Clean? New Scenarios and Challenges in Maintaining Digital Libraries

Lee, D; Kang, J; Mitra, P; Giles, CL; On, BW

In many scientific-publication digital libraries (DLs) such as CiteSeer, arXiv e-Print, DBLP, or Google Scholar,

Effective and scalable solutions for mixed and split citation problems in digital libraries

Lee, Dongwon; On, Byung-Won; Kang, Jaewoo; Park, Sanghyun
Information Quality in Informational Systems

In this paper, we consider two important problems that commonly occur in bibliographic digital libraries, which seriously degrade their data qualities: Mixed Citation (MC) problem (i.e., citations of different scholars with their names being homonyms are mixed together) and Split Citation (SC) problem (i.e., citations of the same author appear under different name variants). In particular, we investigate an effective yet scalable solution since citations in such digital libraries tend to be large-scale.

Syndicate content