Fastest Code for Minwise Hashing (can be thousand times faster)

The Idea of Densification

 

Traditional Minwise Hashing is slow. Applications require computing thousands of minhashes for dimensionality reduction, compression, etc. Computing MinHashes needs thousands of passes over the data which is unnecessarily expensive. We show single pass and constant time minwiseh hashing algorithm which can be thousands of time faster than the fastest MinHash code. We use densification for fast minwise hashing over sparse data.

Red-Green Sampling

 

For weighted minwise hashing it is possible to construct a novel red-green map which can mimic the minwise sampling distribution requiring only 1/sparsity number of random numbers. We obtain a tremendous (60000x) speedup on many real tasks.

Related Papers

 

Simple and Efficient Weighted Minwise Hashing [pdf]

Anshumali Shrivastava. (NIPS 2016)

 

Optimal Densification for Fast and Accurate Minwise Hashing [pdf]

Anshumali Shrivastava. (ICML 2017)

[Fastest unweighted Minwise hashing. Has better variance and same speed as ICML 14 and UAI 14]

 

Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search [pdf]

Anshumali Shrivastava and Ping Li. (ICML 2014)

 

Improved Densification of One Permutation Hashing [pdf]

Anshumali Shrivastava and Ping Li. (UAI 2014)

Download Code

Website by: Pallavi

Contact

Anshumali at rice dot edu

3118 Duncan Hall

Rice University