Hardening soft information sources

Authors: 
Cohen, WW; Kautz, H; McAllester, D
Author: 
Cohen, W
Kautz, H
McAllester, D
Year: 
2000
Venue: 
Proceedings of the sixth ACM SIGKDD international conference
URL: 
http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=347141
Citations: 
91
Citations range: 
50 - 99
AttachmentSize
Cohen2000Hardeningsoftinformation.pdf134.36 KB

The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases
are \"soft\": they contain inconsistencies and duplication, and
lack unique, consistently-used object identifiers. Examples include large bibliographic databases harvested from raw scientific papers or databases constructed by merging heterogeneous \"hard\" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global - many sources of evidence for
a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for finding a local optimum.