Bioinformatics

Some methods for blindfolded record linkage

Authors: 
Churches, T; Christen, P
Year: 
2004
Venue: 
BMC Medical Informatics and Decision Making

The linkage of records which refer to the same entity in separate data collections is a common requirement in public health and biomedical research. Traditionally, record linkage techniques have required that all the identifying data in which links are sought be revealed to at least one party, often a third party. This necessarily invades personal privacy and requires complete trust in the intentions of that party and their ability to maintain security and confidentiality.

An Entity Resolution Framework for Deduplicating Proteins

Authors: 
Lochovsky, L; Topaloglou, T
Year: 
2008
Venue: 
Lecture Notes in Computer Science

An important prerequisite to successfully integrating protein data is detecting duplicate records spread across different databases.

Biological data cleaning: a case study

Authors: 
Herbert, KG; Wang, JTL
Year: 
2007
Venue: 
International Journal of Information Quality

As databases become more pervasive through the biological sciences, various data quality concerns are emerging. Biological databases tend to develop data quality issues regarding data legacy, data uniformity and data duplication. Due to the nature of this data, each of these problems is non-trivial and can cause many problems for the database. For biological data to be corrected and standardised, methods and frameworks must be developed to handle both structural and traditional data. This paper discusses issues concerning biological data quality with respect to data cleaning.

BIO-AJAX: an extensible framework for biological data cleaning

Authors: 
Herbert, KG; Gehani, NH; Piel, WH; Wang, JTL; Wu, CH
Year: 
2004
Venue: 
ACM SIGMOD Record

As databases become more pervasive through the biological sciences, various data quality issues regarding data legacy, data uniformity and data duplication arise. Due to the nature of this data, each of these problems is non-trivial. For biological data to be corrected and standardized, new methods and frameworks must be developed. This paper proposes one such framework, called BIO-AJAX, which uses principles from data cleaning to improve data quality in biological information systems, specifically in TreeBASE.

A method for similarity-based grouping of biological data

Authors: 
Jakoniene, V; Rundqvist, D;Lambrix, P
Year: 
2006
Venue: 
Proc. DILS06, LNCS 4075

Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases.

Erkennen und Bereinigen von Datenfehlern in naturwissenschaftlichen Daten

Authors: 
Müller, H; Weis, M; Bleiholder, J; Leser, U
Year: 
2005
Venue: 
Datenbankspektrum, Vol. 15

Naturwissenschaftliche Daten sind aufgrund
ihres Entstehungsprozesses oft mit
einem hohen Maß an Unsicherheit behaftet.
Bei der Integration von Daten aus verschiedenen
Quellen führen diese Unsicherheiten,
neben der vielfältigen syntaktischen
und semantischen Heterogenität in
der Repräsentation von Daten, zu Konflikten,
die in einer verringerten Qualität des
integrierten Datenbestandes münden. Obwohl
Konflikte oftmals nur durch Domänenexperten
endgültig aufgelöst werden
können, kann und muss die Arbeit dieser
Experten durch geeignete Werkzeuge unterstützt

Febrl - Freely extensible biomedical record linkage

Authors: 
Christen, Peter; Churches, Tim
Year: 
2002
Venue: 
ANU Computer Science Technical Reports

This manual describes prototype software called Febrl designed to undertake probabilistic data cleaning (or standardisation) and record linkage. Written in the Python programming language, this software aims to allow health, biomedical and other researchers to clean (standardise) and link data sets of all sizes faster, with less effort and with improved quality.

Syndicate content