A comparison of string distance metrics for name-matching tasks

Authors: 
Cohen, WW; Ravikumar, P; Fienberg, SE
Author: 
Cohen, W
Ravikumar, P
Fienberg, S
Year: 
2003
Venue: 
Proceedings of the IJCAI-2003 Workshop on Information
URL: 
http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf
Citations: 
1091
Citations range: 
1000s
AttachmentSize
Cohen2003Acomparisonofstringdistance.pdf393.29 KB

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.