Top-k Set Similarity Joins

Guided search

Click a term to initiate a search.

Keyword search

Top-k Set Similarity Joins

Thu, 09/03/2009 - 14:14 — koepcke

Authors:

Xiao, Chuan; Wang, Wei; Lin, Xuemin; Shang, Haichuan

Author:

Lin, X

Xiao, C

Shang, H

Wang, W

Year:

2009

Venue:

ICDE

Citations:

Citations range:

50 - 99

Attachment	Size
Top-k Set Similarity Joins.pdf	338.49 KB

Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.

cse.unsw.edu.au

websearch

Data Cleaning publication categorizer

Guided search

Data Cleaning

Data sets

Data type

Paper type

Venue type

Author

Year

mailpart

Citations range

Keyword search

Top-k Set Similarity Joins

Related categories

User login