Duplicate record elimination in large data files

Authors: 
Bitton, D.; DeWitt, D.J.
Author: 
Bitton, D
DeWitt, D
Year: 
1983
Venue: 
ACM Transactions on Database Systems (TODS), 8, 1983
URL: 
http://portal.acm.org/citation.cfm?id=319987&dl=
Citations: 
208
Citations range: 
100 - 499
AttachmentSize
Bitton1983Duplicaterecordeliminationin.pdf727.81 KB

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.