'How to Vectorize 10million rows with for and if conditions in python?
I have one row which one is full of Unix timestamp. I need to create a new row and label them as 'occured' if the timestamp falls into certain range. but i have 10 million rows to iterate.
my code:
`
start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
for i in range(0, len(df['time'])):
if (start_time < df['time'][i] < end_time).all():
df.loc[i, 'event'] = 'occured'
else:
pass `
the above code is running forever to process 10 million rows. i tried vectorizing but i could not do it as i dont have much experience. Can some one help me in this regard ?
Solution 1:[1]
Here is a way to use vectorized operations in pandas to flag values in the range:
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
We use vectorized pandas comparison and boolean operations (<, > and &) to populate the appropriate rows in the new event column.
Full test code is here:
start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
import pandas as pd
import random
df = pd.DataFrame({'time':random.choices(range(10), k=10000)})
df.time = df.time *1.0e6 + 1.583855e+12
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
print(f"count of 'occurred': {sum(df['event'] == 'occurred')}")
print(df.head())
Output:
count of 'occurred': 975
time event
0 1.583863e+12 NaN
1 1.583861e+12 occurred
2 1.583862e+12 NaN
3 1.583860e+12 NaN
4 1.583861e+12 occurred
If you prefer to have an empty string in rows without 'occurred', you can just initialize the event column in advance:
df['event'] = ''
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | constantstranger |
