- Track:
- Data Mining
- Paper Title:
- Scaling Up All Pairs Similarity Search
- Authors:
- Roberto Bayardo (Google)
- Yiming Ma (U. California Irvine)
- Ramakrishnan Srikant (Google)

- Abstract:
- Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose novel optimization and indexing techniques for this problem, resulting in an algorithm that is both faster and simpler than the previous state-of-the-art approaches. We demonstrate the effectiveness of our algorithm on the public DBLP dataset, and on two real-world web applications: generating recommendations for the Orkut social network, and computing pairs of similar queries from search snippet data among the 5 million most frequently issued Google queries. Our algorithm is between 5 times to 20 times faster than previous algorithms on these datasets.
- Slot:
- Alberta, Friday, May 11, 2007, 10:30am to 12 noon.