Click a term to initiate a search.
The web contains a large quantity of unstructured information. In many cases, it is possible to heuristically extract structured information, but the resulting databases
are \"soft\": they contain inconsistencies and duplication, and
lack unique, consistently-used object identifiers. Examples include large bibliographic databases harvested from raw scientific papers or databases constructed by merging heterogeneous \"hard\" databases. Here we formally model a soft database as a noisy version of some unknown hard database. We then consider the hardening problem, i.e., the problem of inferring the most likely underlying hard database given a particular soft database. A key feature of our approach is that hardening is global - many sources of evidence for
a given hard fact are taken into account. We formulate hardening as an optimization problem and give a nontrivial nearly linear time algorithm for finding a local optimum.