'How to update numpy array based on pandas DataFrame

I have a numpy array with thousands of rows and columns, and I'm wondering how to update each value based on the values in a pandas DataFrame.

For example, let's say my array contains a list of years (here's an incredibly small sample just to give you the basic idea):

[[2020, 2015, 2017],
 [2015, 2016, 2016],
 [2019, 2018, 2020]]

I want to change each value in the array to "Lat" based on the "Year". So if my pandas dataframe looks like this:

Year	Lat	Lon
2020	37.2	103.45
2019	46.1	107.82
2018	35.2	101.45
2017	38.6	110.62
2016	29.1	112.73
2015	33.8	120.92

Then the output array should look like:

[[37.2, 33.8, 38.6],
 [33.8, 29.1, 29.1],
 [46.1, 35.2, 37.2]]

If my dataset were truly this small, it wouldn't be a problem, but considering I have millions of values in the array and thousands of values in the DataFrame, I'm a little overwhelmed on how to go about this efficiently.

Update:

Perhaps my question might be a bit more complicated than I anticipated. Rather than matching up the years, I'm matching up GPS time, so the numbers don't match up as nicely. Is there a way to take a number in the array and match it up to the closest value in the DataFrame column? In reality, my array would look more like this:

[[2019.99, 2015.2, 2017.1],
 [2015.33, 2016.01, 2015.87],
 [2019.2, 2018.3, 2020.00]]

Solution 1:^[1]

np.unique can be used to detect the unique values in the years list, then return_inverse=True can be set to return the indices necessary to recreate the input array.

We can use this in conjunction with set_index and reindex to create a Series of values that can be converted to_numpy. Then the results of the indices from np.unique can be used with this array of latitude values to select the necessary values. A final reshape can be used to get the array in the correct form.

u, inv = np.unique(years, return_inverse=True)
result = (
    df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv].reshape(years.shape)
)

result:

[[37.2 33.8 38.6]
 [33.8 29.1 29.1]
 [46.1 35.2 37.2]]

Results from np.unique

u, inv = np.unique(years, return_inverse=True)

u=array([2015, 2016, 2017, 2018, 2019, 2020])
inv=array([5, 0, 2, 0, 1, 1, 4, 3, 5])

The Lat column with the Year as the index:

df.set_index('Year')['Lat']

Year
2020    37.2
2019    46.1
2018    35.2
2017    38.6
2016    29.1
2015    33.8
Name: Lat, dtype: float64

reindexed to match the order from np.unique:

df.set_index('Year')['Lat'].reindex(u)

Year
2015    33.8
2016    29.1
2017    38.6
2018    35.2
2019    46.1
2020    37.2
Name: Lat, dtype: float64

NumPy indexing to select from this new Series:

df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv]

array([37.2, 33.8, 38.6, 33.8, 29.1, 29.1, 46.1, 35.2, 37.2])

The final reshape to match the initial input years array dimensions:

df.set_index('Year')['Lat'].reindex(u).to_numpy()[inv].reshape(years.shape)

array([[37.2, 33.8, 38.6],
       [33.8, 29.1, 29.1],
       [46.1, 35.2, 37.2]])

Setup:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Year': [2020, 2019, 2018, 2017, 2016, 2015],
    'Lat': [37.2, 46.1, 35.2, 38.6, 29.1, 33.8],
    'Lon': [103.45, 107.82, 101.45, 110.62, 112.73, 120.92]
})

years = np.array([[2020, 2015, 2017],
                  [2015, 2016, 2016],
                  [2019, 2018, 2020]])

Solution 2:^[2]

You're basically mapping values across columns. One idea is to use indexing to locate the elements that need to be replaced for a given key, then replace them all at once. This takes one iteration for each key-value pair in the original data.

Example:

import numpy as np
import pandas as pd

a = np.array([
    [2020, 2015, 2017],
    [2015, 2016, 2016],
    [2019, 2018, 2020],
])
b = np.zeros(a.shape, dtype=float)

df = pd.DataFrame({
    'Year': [2020, 2019, 2018, 2017, 2016, 2015],
    'Lat': [37.2, 46.1, 35.2, 38.6, 29.1, 33.8],
})

for k, v in df.set_index('Year')['Lat'].to_dict().items():
    b[a == k] = v
print(b)

# output:
# [[37.2 33.8 38.6]
#  [33.8 29.1 29.1]
#  [46.1 35.2 37.2]]

Solution 3:^[3]

In one line:

df.set_index('Year').Lat.loc[arr.flatten()].to_numpy().reshape(arr.shape)

If you're going to do multiple operations like this you should call set_index() just once, perhaps with inplace=True if you want to modify the existing DataFrame rather than create a new one.

After that it's just a matter of giving loc a 1D array which it can use for efficient lookup of the Lat values, then reshaping the result to match the original arr.

This is similar to d.b's answer, but massively more efficient because it does not use Python for loops.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Henry Ecker
Solution 2	jfaccioni
Solution 3	John Zwinck

'How to update numpy array based on pandas DataFrame

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]