'How to exclude the point itself in Sklearn NearestNeighbors?
I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:
A1 A2 ... A40
0 xx xx ... xx
1 xx xx ... xx
2 xx xx ... xx
... ...
399,999 xx xx ... xx
I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data.
So now we have 400,000 customers(points/vectors), each has 40 dimensions.
I then used the NearestNeighbors to calculate the top 5 nearest points for each point. So far so good.
But there is a little problem with the results.
The results cointain the point itself, and it appears in a random position, not always the first one.
The result looks like:
(
[[0,0.04,0.06,0.09,0.1,0.12], ---case a
[0,0.01,0.05,0.07,0.08,0.09], ---case b
[0,0,0,0.04,0.05,0.06,0.08], ---case c
...
[0,0,0,0,0,0], ---case d
[0,0.06,0.07,0.09,0.1,0.12], ---case e
[0,0.01,0.03,0.05,0.07,0.,8]], ---case f
[[0,2143,14134,54253,242425,3423], ---case a
[1,43242,132,34324,31234,44355], ---case b
[343245,32113,2,32435,23451,54131] ---case c
...
[231413,21597,74958,7923,13988,98137], ---case d
[399998,13145,54361,48831,94813,41873], ---case e
[399999,88213,43431,31414,42313,87481]] ---case f
)
The first item of the tuple is the distance array, the second is the index array of top 6 nearest points. There are 6 elements in each item, because I originally thought that by removing the first column(the point itself), the remaining 5 columns would be the result.
As you can see, for case a, case b, case e and case f, it is ok, their first elements are the point iteself, and the corresponding distance is 0.
But for case c, because there are three points with 0 distance, so the index 2 does not appear in the first postion, but the third postion.
And for case d, because there are too many points with 0 distance, the index 399997 even does not show in the top 6 nearest points.
So how can I remove the point itself in the top 6 nearest points? If all the cases are like case a, case b, case e and case f, I can just simply remove the first column of the index array of top 6 nearest points. But the current problem is, it appears at a random position, sometimes it even does not show up. Any ideas?
Solution 1:[1]
As I see from the second list of the tuple, examples are sorted in the order the same as the original order from the DataFrame. So, for the second list, it needs to remove from each example the element equals the index of the example in the list. For the first list, needs to delete element with the index equals the index of the element.
for idx_example, example in enumerate(tuple_with_items[1]):
try:
idx_element = example.index(idx_example)
except ValueError:
idx_element = 0
del example[idx_element]
del tuple_with_items[0][idx_example][idx_element]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
