'PySpark MLLib APproximate nearest neighbour search for multiple keys

I want to use ANN from PySpark. I have a DataFrame of 100K keys for which I want to perform top-10 ANN searches on an already transformed Spark DataFrame. But it seems that API of BucketedRandomProjectionLSH expects only one key at a time. I also want to avoid using approxSimilarityJoin, because it only allows you to set a threshold, but this would lead to a variable k for each key (it also fails in my case saying that for some records there are no NNs for a given threshold).

Currently, the best thing I came up with is .collect() the keys and call approxNearestNeighbors in a for loop on the driver, but it is terribly inefficient.

Does anyone know how I can get top-10 ANN searches for my 100K keys in parallel?

Thank you.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source