A grammar-based entity representation framework for data cleaning

Authors: 
Arasu, Arvind; Kaushik, Raghav
Author: 
Arasu, A
Kaushik, R
Year: 
2009
Venue: 
SIGMOD
DOI: 
http://doi.acm.org/10.1145/1559845.1559871
Citations: 
12
Citations range: 
10 - 49
AttachmentSize
A grammar-based entity representation framework for data cleaning.pdf1.3 MB

Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.