Fastest Code for Minwise Hashing (can be thousand times faster)
The Idea of Densification
Traditional Minwise Hashing is slow. Applications require computing thousands of minhashes for dimensionality reduction, compression, etc. Computing MinHashes needs thousands of passes over the data which is unnecessarily expensive. We show single pass and constant time minwiseh hashing algorithm which can be thousands of time faster than the fastest MinHash code. We use densification for fast minwise hashing over sparse data.
Red-Green Sampling
For weighted minwise hashing it is possible to construct a novel red-green map which can mimic the minwise sampling distribution requiring only 1/sparsity number of random numbers. We obtain a tremendous (60000x) speedup on many real tasks.
Related Papers
Simple and Efficient Weighted Minwise Hashing [pdf]
Anshumali Shrivastava. (NIPS 2016)
Optimal Densification for Fast and Accurate Minwise Hashing [pdf]
Anshumali Shrivastava. (ICML 2017)
[Fastest unweighted Minwise hashing. Has better variance and same speed as ICML 14 and UAI 14]
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search [pdf]
Anshumali Shrivastava and Ping Li. (ICML 2014)
Improved Densification of One Permutation Hashing [pdf]
Anshumali Shrivastava and Ping Li. (UAI 2014)
Download Code
Website by: Pallavi
Contact
Anshumali at rice dot edu
3118 Duncan Hall
Rice University