'How to update numpy array based on pandas DataFrame
I have a numpy array with thousands of rows and columns, and I'm wondering how to update each value based on the values in a pandas DataFrame.
For example, let's say my array contains a list of years (here's an incredibly small sample just to give you the basic idea):
[[2020, 2015, 2017],
[2015, 2016, 2016],
[2019, 2018, 2020]]
I want to change each value in the array to "Lat" based on the "Year". So if my pandas dataframe looks like this:
| Year | Lat | Lon |
|---|---|---|
| 2020 | 37.2 | 103.45 |
| 2019 | 46.1 | 107.82 |
| 2018 | 35.2 | 101.45 |
| 2017 | 38.6 | 110.62 |
| 2016 | 29.1 | 112.73 |
| 2015 | 33.8 | 120.92 |
Then the output array should look like:
[[37.2, 33.8, 38.6],
[33.8, 29.1, 29.1],
[46.1, 35.2, 37.2]]
If my dataset were truly this small, it wouldn't be a problem, but considering I have millions of values in the array and thousands of values in the DataFrame, I'm a little overwhelmed on how to go about this efficiently.
Update:
Perhaps my question might be a bit more complicated than I anticipated. Rather than matching up the years, I'm matching up GPS time, so the numbers don't match up as nicely. Is there a way to take a number in the array and match it up to the closest value in the DataFrame column? In reality, my array would look more like this:
[[2019.99, 2015.2, 2017.1],
[2015.33, 2016.01, 2015.87],
[2019.2, 2018.3, 2020.00]]
Solution 1:[1]
np.unique can be used to detect the unique values in the years list, then return_inverse=True can be set to return the indices necessary to recreate the input array.
We can use this in conjunction with set_index and reindex to create a Series of values that can be converted to_numpy. Then the results of the indices from np.unique can be used with this array of latitude values to select the necessary values. A final reshape can be used to get the array in the correct form.
u, inv = np.unique(years, return_inverse=True)
result = (
df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv].reshape(years.shape)
)
result:
[[37.2 33.8 38.6]
[33.8 29.1 29.1]
[46.1 35.2 37.2]]
Results from np.unique
u, inv = np.unique(years, return_inverse=True)
u=array([2015, 2016, 2017, 2018, 2019, 2020])
inv=array([5, 0, 2, 0, 1, 1, 4, 3, 5])
The Lat column with the Year as the index:
df.set_index('Year')['Lat']
Year
2020 37.2
2019 46.1
2018 35.2
2017 38.6
2016 29.1
2015 33.8
Name: Lat, dtype: float64
reindexed to match the order from np.unique:
df.set_index('Year')['Lat'].reindex(u)
Year
2015 33.8
2016 29.1
2017 38.6
2018 35.2
2019 46.1
2020 37.2
Name: Lat, dtype: float64
NumPy indexing to select from this new Series:
df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv]
array([37.2, 33.8, 38.6, 33.8, 29.1, 29.1, 46.1, 35.2, 37.2])
The final reshape to match the initial input years array dimensions:
df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv].reshape(years.shape)
array([[37.2, 33.8, 38.6],
[33.8, 29.1, 29.1],
[46.1, 35.2, 37.2]])
Setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Year': [2020, 2019, 2018, 2017, 2016, 2015],
'Lat': [37.2, 46.1, 35.2, 38.6, 29.1, 33.8],
'Lon': [103.45, 107.82, 101.45, 110.62, 112.73, 120.92]
})
years = np.array([[2020, 2015, 2017],
[2015, 2016, 2016],
[2019, 2018, 2020]])
Solution 2:[2]
You're basically mapping values across columns. One idea is to use indexing to locate the elements that need to be replaced for a given key, then replace them all at once. This takes one iteration for each key-value pair in the original data.
Example:
import numpy as np
import pandas as pd
a = np.array([
[2020, 2015, 2017],
[2015, 2016, 2016],
[2019, 2018, 2020],
])
b = np.zeros(a.shape, dtype=float)
df = pd.DataFrame({
'Year': [2020, 2019, 2018, 2017, 2016, 2015],
'Lat': [37.2, 46.1, 35.2, 38.6, 29.1, 33.8],
})
for k, v in df.set_index('Year')['Lat'].to_dict().items():
b[a == k] = v
print(b)
# output:
# [[37.2 33.8 38.6]
# [33.8 29.1 29.1]
# [46.1 35.2 37.2]]
Solution 3:[3]
In one line:
df.set_index('Year').Lat.loc[arr.flatten()].to_numpy().reshape(arr.shape)
If you're going to do multiple operations like this you should call set_index() just once, perhaps with inplace=True if you want to modify the existing DataFrame rather than create a new one.
After that it's just a matter of giving loc a 1D array which it can use for efficient lookup of the Lat values, then reshaping the result to match the original arr.
This is similar to d.b's answer, but massively more efficient because it does not use Python for loops.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Henry Ecker |
| Solution 2 | jfaccioni |
| Solution 3 | John Zwinck |
