Std.-/normalization

Techniques for automatically correcting words in text

Authors: 
Kukich, K.
Year: 
1992
Venue: 
ACM Computing Surveys (CSUR), 24, 1992

Research aimed at correcting words in text has focused on three progressively more difficult problems:(1) nonword error detection; (2) isolated-word error correction; and (3) context-dependent work correction. In response to the first problem, efficient pattern-matching and n-gram analysis techniques have been developed for detecting strings that do not appear in a given word list. In response to the second problem, a variety of general and application-specific spelling correction techniques have been developed. Some of them were based on detailed studies of spelling error patterns.

Data Transformation for Warehousing Web Data

Authors: 
Zhu, Yan; Bornhovd, Christof; Buchmann, Alejandro P.
Year: 
2001
Venue: 
Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS '01), 2001

In order to analyze market trends and make reasonable business plans, a company's local data is not sufficient. Decision making must also be based on information from suppliers, partners and competitors. This external data can be obtained from the Web in many cases, but must be integrated with the company's own data, for example, in a data warehouse. To this end, Web data has to be mapped to the star schema of the warehouse. In this paper we propose a semi-automatic approach to support this transformation process.

Febrl - Freely extensible biomedical record linkage

Authors: 
Christen, Peter; Churches, Tim
Year: 
2002
Venue: 
ANU Computer Science Technical Reports

This manual describes prototype software called Febrl designed to undertake probabilistic data cleaning (or standardisation) and record linkage. Written in the Python programming language, this software aims to allow health, biomedical and other researchers to clean (standardise) and link data sets of all sizes faster, with less effort and with improved quality.

Probabilistic Name and Address Cleaning and Standardization

Authors: 
Christen, P.; Churches, T.; Zhu, J.
Year: 
2002
Venue: 
Proceedings of the Australasian Data Mining Workshop, 2002

In the absence of a shared unique key, an ensemble of nonunique
personal attributes such as names and addresses is
often used to link data from disparate sources. Such data
matching is widely used when assembling data warehouses
and business mailing lists, and is a foundation of many longitudinal
epidemiological and other health related studies.
Unfortunately, names and addresses are often captured in
non-standard and varying formats, usually with some degree
of spelling and typographical errors. It is therefore
important that such data is transformed into a clean and

Data integration using similarity joins and a word-based information representation language

Authors: 
Cohen, W.W.
Year: 
2000
Venue: 
ACM Transactions on Information Systems (TOIS), 18, 2000

The integration of distributed, heterogeneous databases, such as those available on the World Wide Web, poses many problems. Herer we consider the problem of integrating data from sources that lack common object identifiers. A solution to this problem is proposed for databases that contain informal, natural-language “names” for objects; most Web-based databases satisfy this requirement, since they usually present their information to the end-user through a veneer of text.

Declarative data cleaning: Language, model, and algorithms

Authors: 
Galhardas, H; Florescu, D; Shasha, D; Simon, E; Saita, C.
Year: 
2001
Venue: 
Proc. VLDB 2001

The problem of data cleaning, which consists of removing
inconsistencies and errors from original data sets, is well known in
the area of decision support systems and data warehouses. This holds
regardless of the application - relational database joining,
web-related, or scientific. In all cases, existing ETL (Extraction
Transformation Loading) and data cleaning tools for writing data
cleaning programs are insufficient. The main challenge is the design
and implementation of a dataflow graph that effectively and
efficiently generates clean data. Needed improvements to the current

Syndicate content