'Fastest way to count event occurences in a Pandas dataframe?
I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str
, Time int
, and Values float
), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv')
.
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC
deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv')
.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|