'comparing values among all other values in that variable to locate nearest neighbor in pandas
I'm trying to figure out the equivalent of the following code in Pandas to find the nearest neighbor by comparing every observation to every other observation, but in Pandas, I don't want to do this with a for loop, since with large datasets, it would seem to be work in vectors vs loops.
Below, let's say that the cholesterol values are the final squared distances based off of other variables. How would I compute the euclidean distance of each cholesterol value to every other value in that variable (so the sum of (xi - yi)^2 and then find the minimum (euclidean) distance, such that it would return its nearest neighbor (whether or not the closest person has in the fifth column and then in the sixth column--whether or not that person has heart disease. So for instance, the nearest neighbor of Braund should be Allen, since (7.25 - 8.05)^2 + (9.45 - 9.23)^2 ... is the closest value.
The answer to this question: Nearest neighbor matching in Pandas is close, but deals with merging data frames and not the values within vectors of one data frame.
d = {'name': ['Braund', 'Cummings', 'Heikkinen', 'Allen'],
'cholesterol_1': [7.25, 71.83, 0 , 8.05], 'cholesterol_2': [9.45, 28.23, 1, 9.23], 'cholesterol_3': [8, 37.83, 1 , 9.35],
'heart_disease': ['Y', 'Y', 'N', 'Y'],}
df = pd.DataFrame(d)
Solution 1:[1]
Here's one way using scipy (to calculate the Euclidean distances) and numpy to do computation on the resulting symmetric matrix:
import numpy as np
from scipy.spatial.distance import cdist
arr = df.filter(like='cholesterol')
euclidean_dist = cdist(arr, arr, metric='euclidean')
np.fill_diagonal(euclidean_dist, np.inf)
df['nearest_neighbor'] = df.loc[euclidean_dist.argmin(axis=1), 'name'].to_numpy()
Output:
cholesterol_1 cholesterol_2 cholesterol_3 heart_disease nearest_neighbor
name
Braund 7.25 9.45 8.00 Y Allen
Cummings 71.83 28.23 37.83 Y Allen
Heikkinen 0.00 1.00 1.00 N Braund
Allen 8.05 9.23 9.35 Y Braund
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
