Click a term to initiate a search.
Data quality is a serious concern in every data management application,
and a variety of quality measures have been proposed, including
accuracy, freshness and completeness, to capture the common
sources of data quality degradation. We identify and focus
attention on a novel measure, column heterogeneity, that seeks to
quantify the data quality problems that can arise when merging data
from different sources. We identify desiderata that a column heterogeneity
measure should intuitively satisfy, and discuss a promising
direction of research to quantify database column heterogeneity
based on using a novel combination of cluster entropy and soft clustering.
Finally, we present a few preliminary experimental results,
using diverse data sets of semantically different types, to demonstrate
that this approach appears to provide a robust mechanism for
identifying and quantifying database column heterogeneity.