'Is it possible to increase the speed of appending files in pandas based on generated combinations?
I need some help in figuring out a way to increase the performance of the below python code.
I am calculating the number of duplicates between a group of excel files based on certain combinations. To achieve this I have used a python script that generates all possible combinations that 10 files can be grouped in 5 groups. Then based on the generated combinations I have continued the script to append those files and calculate the number of duplicates in each group.
Example:
Sample output of the entire script ( combinations for 10 files combined in 5 groups is only a sample)
The first part of my script generates all possible combinations the 10 excel files can be grouped together in 5 groups, example:
Combination 1: (File 1) (File 2) (File 3) (File 4) (File 5 File 6 File 7 File 8 File 9 File 10)
Combination 2: (File 1 File 2) (File 3) (File 4) (File 5) (File 6 File 7 File 8 File 9 File 10)
Combination 3: etc...
def sorted_k_partitions(seq, k):
n = len(seq)
groups = [] # a list of lists, currently empty
def generate_partitions(i):
if i >= n:
yield list(map(tuple, groups))
else:
if n - i > k - len(groups):
for group in groups:
group.append(seq[i])
yield from generate_partitions(i + 1)
group.pop()
if len(groups) < k:
groups.append([seq[i]])
yield from generate_partitions(i + 1)
groups.pop()
result = generate_partitions(0)
The second part of my code appends the files and calculates the total duplicates based on each combination:
for k in 1, 5: #k is the number of groups i want to distribute the files into
for groups in sorted_k_partitions(seq, k): #sorted_k_partitions gets the combinations
groupnumberofduplicates = 0
for group in groups: #groups is each combination(ex: Combination 1), group is each group in a combination (ex: Group 1 in ss)
ids = pd.DataFrame()
for file in group:
ids = ids.append(pd.read_excel(file))
numberofduplicates = ids.duplicated().sum()
groupnumberofduplicates = groupnumberofduplicates + numberofduplicates
grouplist = grouplist.append({"k":k, "groupname" : groups, "groupnumberofduplicates": groupnumberofduplicates}, ignore_index =True )
Taking into consideration that each file has 1 column with 100,000 rows the script is really slow but it's working. Its getting slow in the second part of the code I showed above mainly because there are many combinations that the script is going into each one and getting the number of duplicates. The second reason is that the number of rows in each file is huge. Is there a way to make this faster?
Thank you and really appreciated anyone that could help.
Update: number of rows is no longer an issue for the speed. Found a way to fix that. The loop is the only reason for long processing time now.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
