'Looping through groupby object by index
I have a huge dataset that I need to pass to a fuzzy matching function in small chunks. I'm testing the dataset against itself so I need to group the batches by city so as to reduce the likelihood of duplicates in the batch.
I have been able to begin this with this logic (the fuzzy_match function takes a dataframe):
cities = insertion_cleaned[city_name].tolist()
batches = insertion_cleaned.groupby(insertion_cleaned[city_name])
for c in cities:
t1 = time.time()
final_df = fuzzy_match(batches.get_group(c))
t2 = time.time()
print(f"{round(t2-t1,2)} seconds to run fuzzy match for {round(len(cities),2)} leads.")
I need to be calling each element of the batch by index instead of c because I need to put this into a try except block to handle errors without stopping the progress through the loop. In other words if it breaks on one index location of c I need the loop to keep track of that index and increment the index by one to move on to the next element. Here is an example of something similar I did with something that was a little easier to loop via index:
def create_batches(df,n):
chunks = np.array_split(df, n)
return chunks
batches = create_batches(df,6)
index = 0
while (index < len(batches)):
p = batches[index]
try:
t1 = time.time()
final_df = fuzzy_match(p)
index +=1
t2 = time.time()
print(f"{round(t2-t1,2)} seconds to run fuzzy match for {round(len(p),2)} leads.")
except Exception:
print("skipping to next")
index +=1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
