'How to speed up iterating row and apply value wise functions?
I have one DataFrame which has three columns: time, lon, and lat.
The goal is to predict the location of each point at the end_time, using a look up Dataset (ds, three dimension [time, lon, lat]) which has two DataArrays: wdir, and wspd.
Here's the prediction process:
- iterate each row of DataFrame
- interpolate
dsto the location determined by the row values - predict the new lon and lat using interpolated wdir, wspd, and the time step (
delta_wind) between time andend_time. - iterate until the
end_time, save the predicted lon and lat.
To understand it easily, I write this simple example.
The core is the iteration of each row after the # --- Sample Data end --- line.
import pandas as pd
import xarray as xr
import numpy as np
# !!!! edit len_time to for testing the speed !!!!
len_time = 2000
# --- Sample Data ---
# Two functions for creating sample data
def random_dates(start, end, n=10):
# random dates based on length (n)
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
def predict_loc(lon, lat, wdir, wspd, delta):
# some calculation depends on inputs, i just simplify them here
lon2 = lon/2; lat2 = lat/2
return lon2, lat2
# create 3d DataArray as the look up array
da = xr.DataArray(np.abs(np.random.randn(600).reshape(6, 10, 10)),
[("time", pd.date_range("20130101", periods=6, freq="1H")),
("lon", range(10)),
("lat", range(10))
],
)
# create two DataArrays (wind direction and wind speed) and merge them into one Dataset
ds = xr.merge([da.rename('wdir')/2, da.rename('wspd')])
# the end time of prediction
end_time = pd.Timestamp('2013-01-01 03:10')
# create times
times = random_dates(pd.to_datetime('2013-01-01 00:00'),
pd.to_datetime('2013-01-01 02:00'),
n=len_time)
# create DataFrame
df = pd.DataFrame(times, columns=['time'])
df['lon'] = np.random.randint(low=0, high=9, size=(len_time))
df['lat'] = np.random.randint(low=0, high=9, size=(len_time))
# --- Sample Data end ---
# create emtpy list for saving results
lons, lats = [], []
# iterate each row
for _, row in df.iterrows():
# because the ds is hourly data, we need to create the hourly time step
times = np.concatenate(([row.time.to_pydatetime()],
pd.date_range(row.time.ceil('h'),
end_time.floor('h'), freq='H').to_pydatetime(),
[end_time.to_pydatetime()]
))
# calculate the delta seconds
delta_wind = [t.total_seconds() for t in np.diff(times)]
# get the beginning location (lon/lat) of the row
lat, lon = row.lat, row.lon
# predict the location by each time step
for t_index, time in enumerate(times[:-1]):
# interpolate to the location at each time
data = ds.interp(time=time, lon=lon, lat=lat)
lon, lat = predict_loc(lon, lat, data['wdir'], data['wspd'], delta_wind[t_index])
# save the new location at the end of time
lons.append(lon)
lats.append(lat)
# add prediction results to DataFrame
df[f'lon_pred'] = lons
df[f'lat_pred'] = lats
When the len_time is increased to 1000 or larger, it's really slow.
Any idea how to improve it?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
