'How to Vectorize 10million rows with for and if conditions in python?

I have one row which one is full of Unix timestamp. I need to create a new row and label them as 'occured' if the timestamp falls into certain range. but i have 10 million rows to iterate.

my code:

`
start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
for i in range(0, len(df['time'])):
    if (start_time < df['time'][i] < end_time).all():
        df.loc[i, 'event'] = 'occured'
    else:
        pass `

the above code is running forever to process 10 million rows. i tried vectorizing but i could not do it as i dont have much experience. Can some one help me in this regard ?



Solution 1:[1]

Here is a way to use vectorized operations in pandas to flag values in the range:

df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'

We use vectorized pandas comparison and boolean operations (<, > and &) to populate the appropriate rows in the new event column.

Full test code is here:

start_time = 1.583860e+12 #example timestamp
end_time = 1.583862e+12
import pandas as pd
import random
df = pd.DataFrame({'time':random.choices(range(10), k=10000)})
df.time = df.time *1.0e6 + 1.583855e+12
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'
print(f"count of 'occurred': {sum(df['event'] == 'occurred')}")
print(df.head())

Output:

count of 'occurred': 975
           time     event
0  1.583863e+12       NaN
1  1.583861e+12  occurred
2  1.583862e+12       NaN
3  1.583860e+12       NaN
4  1.583861e+12  occurred

If you prefer to have an empty string in rows without 'occurred', you can just initialize the event column in advance:

df['event'] = ''
df.loc[(start_time < df.time) & (df.time < end_time), 'event'] = 'occurred'

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 constantstranger