Cleansing Databases of Misspelled Proper Nouns

Authors: 
Mazeika, A.; Bohlen, M.H.
Author: 
Mazeika, A
Bohlen, M
Year: 
2006
Venue: 
Clean DB, 2006
URL: 
http://pike.psu.edu/cleandb06/papers/CameraReady_120.pdf
Citations: 
0
Citations range: 
n/a
AttachmentSize
Mazeika2006CleansingDatabasesof.pdf140.56 KB

The paper presents a data cleansing technique for
string databases. We propose and evaluate an
algorithm that identifies a group of strings that
consists of (multiple) occurrences of a correctly
spelled string plus nearby misspelled strings. All
strings in a group are replaced by the most frequent
string of this group. Our method targets
proper noun databases, including names and addresses,
which are not handled by dictionaries.
At the technical level we give an efficient solution
for computing the center of a group of strings
and determine the border of the group. We use inverse
strings together with sampling to efficiently
identify and cleanse a database. The experimental
evaluation shows that for proper nouns the center
calculation and border detection algorithms
are robust and even very small sample sizes yield
good results.