inf.unibz.it

Cleansing Databases of Misspelled Proper Nouns

Authors: 
Mazeika, A.; Bohlen, M.H.
Year: 
2006
Venue: 
Clean DB, 2006

The paper presents a data cleansing technique for
string databases. We propose and evaluate an
algorithm that identifies a group of strings that
consists of (multiple) occurrences of a correctly
spelled string plus nearby misspelled strings. All
strings in a group are replaced by the most frequent
string of this group. Our method targets
proper noun databases, including names and addresses,
which are not handled by dictionaries.
At the technical level we give an efficient solution
for computing the center of a group of strings
and determine the border of the group. We use inverse

Syndicate content