'Efficient nearest neighbour search on streaming data

I'm looking to harness the magic of Python to perform a nearest neighbour (NN) search on a dataset that will grow over time, for example because new entries arrive through streaming. On static data, a NN search is doable as shown below. To start off the example, e and f are two sets of static data points. For each entry in e we need to know which one in f is nearest to it. It's simple enough to do:

e=pd.DataFrame({'lbl':['a','b','c'],'x':[1.5,2,2.5],'y':[1.5,3,2]})
f=pd.DataFrame({'x':[2,2],'y':[2,1.5]})
from sklearn.neighbors import BallTree as bt
def runtree():
    tree=bt(f[['x','y']],2)
    dy,e['nearest']=tree.query(e[['x','y']],1)
    return e

runtree() returns e with the index of the nearest data point in the final column:

    lbl     x   y   nearest
0   a   1.5     1.5     1
1   b   2.0     3.0     0
2   c   2.5     2.0     0

Now, let's treat f as a dataframe that will grow over time, and add a new record to it: f.loc[2]=[2.5]+[1.75]

When running runtree() again, the record with lbl=c is closer to the new entry (the bottom right entry shows index=2 now). Before the new entry was added, the same record was closest to index 0 (see above):

    lbl     x   y   nearest
0   a   1.5     1.5     1
1   b   2.0     3.0     0
2   c   2.5     2.0     2

The question is, is there a way to get this final result without rerunning runtree() for all the records in e, but instead refreshing only the ones that are relatively close to the new entry we've added in f? In the example it would be great to have a way to know that only the final row needs to be refreshed, without running all the rows to find out.

To put this into context: in the example for e above, we have three records in two dimensions. A real world example could have millions of records in more than two dimensions. It seems inefficient to rerun all the calculations every time that a new record arrives in f. A more efficient method might factor in that some of the entries in e are nowhere near the new one in f so they should not need updating.

It might be possible to delve into Euclidean distance maths, but my sense is that all the heavy lifting has already been done in packages like BallTree.

Does a package exist that can do what is needed here, on growing rather than static data, without lifting the bonnet on some serious math?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Efficient nearest neighbour search on streaming data

Sources

Related Questions