(Almost) Hands-Off Information Integration for the Life Sciences

Leser, Ulf; Naumann, Felix
Conference in Innovative Database Research (CIDR) 2005

Data integration in complex domains, such as the
life sciences, involves either manual data curation,
offering highest information quality at highest
price, or follows a schema integration and mapping
approach, leading to moderate information quality
at a moderate price. We suggest a radically differ-
ent integration approach, called ALADIN, for the
life sciences application domain. The predominant
feature of the ALADIN system is an architecture
that allows almost automatic integration of new
data sources into the system, i.e., it offers data in-
tegration at almost no cost.

Declarative Data Fusion - Syntax, Semantics, and Implementation

Bleiholder, Jens; Naumann, Felix
Advances in Databases and Information Systems (ADBIS) 2005

In today’s integrating information systems data fusion, i.e., the merging of multiple tuples about the same real-world object into a single tuple, is left to ETL tools and other specialized software. While much attention has been paid to architecture, query languages, and query execution, the final step of actually fusing data from multiple sources into a consistent and homogeneous set is often ignored.

Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies

Naumann, Felix; Bilke, Alexander; Bleiholder, Jens; Weis, Melanie
IEEE Data Engineering Bulletin 29(2):21-31

Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it rep-
resents identical real-world objects multiple times, causing duplicates, and it has missing values and
conflicting values. Without suitable techniques for integrating and fusing such data, the data quality of
an integrated system remains low. We present a suite of methods, combined in a single tool, that allows
ad-hoc, declarative fusion of such data by employing schema matching, duplicate detection and data

A Duplicate Detection Benchmark for XML (and Relational) Data

Weis, M.; Naumann, F.; Brosy, F.
Proc. Workshop on Information Quality for Information Systems (IQIS)

Duplicate detection, which is an important subtask of data
cleaning, is the task of identifying multiple representations of a
same real-world object. Numerous approaches both for relational
and XML data exist. Their goals are either on improving the quality
of the detected duplicates (effectiveness) or on saving computation
time (efficiency). In particular for the first goal, the “goodness”
of an approach is usually evaluated based on experimental
studies. Although some methods and data sets have gained popularity,
it is still difficult to compare different approaches or to

Syndicate content