'Assign values from Numpy Array to Pandas DataFrame based on almost matching unix timestamp
I am given a 2D numpy array and a huge pandas DataFrame. A dummy example of them would look somewhat like this:
arr = np.array([[1648137283, 0],
[1648137284, 1],
[1648137285, 2],
[1648137286, 3],
.....
[1658137287, 4],
[1658137288, 5],
[1658137289, 6]])
df.head(-6)
unix ... value_a
0 1643137283 ... 23
1 1643137284 ... 54
2 1643137285 ... 25
... ... ... ...
10036787 1653174068 ... 75
10036788 1653174069 ... 65
10036789 1653174070 ... 23
In the first column of arr is a unix timestamp and in the second an id-value. The DataFrame also has a column for the unix timestamp. My goal is to map the id-value from arr based on the unix timestamp to the corresponding timestamp of df in a separate new column called 'index'.
Now, these are probably important notes:
dfcontains only a portion of all timestamps fromarrdfandarrhave different lengths along theaxis=0- the timestamps in
dfare ordered in sequences and repeat themselves arrcontains all unix timestamps fromdfbut not the way around- about 1% of the unix values do not match perfectly. My unix is in
unit='ms', some timestamps are off by +/-1 or +/-2, however, in my use cases they can bee seen as identical
I could do this within a loop or with np.where(). However, as arr and df are quite large, I was hoping for a fast solution.
Solution 1:[1]
The idea is to convert the numpy array to a mapping containing key-val pairs, where key is unix timestamps and value is correponding id, then you can use series.map to substitute/map the values in the given dataframe
df['index'] = df['unix'].map(dict(arr))
Sample output
unix ... value_a index
0 1643137283 ... 23 0
1 1643137284 ... 54 1
2 1643137285 ... 25 2
10036787 1653174068 ... 75 3
10036788 1653174069 ... 65 5
10036789 1653174070 ... 23 6
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Shubham Sharma |
