'Automate clusterization using python and Excel

Looking for some bright ideas to implement in my tool. Backround : I have created this python tool which would take as input a excel file. Excel file contains data on 4 columns (A-contains Part Numbers for products, B-contains names for products which are strings, not simple words, C and B-contain product descriptions also being strings/large text). This tool contains multiple functions from which end user can pick and based on this it will group and output cell values from A(part numbers) based on given keywords. But before doing this the program will look in all text from B, and output keyword frequency from witch end user can pick a keyword.

Example of a function: Tool looks in B and shows a list of common keywords User will choose a keyword - FIRSTWORD Tool will filter all cells in B and look inside the ones which contain FIRSTWORD and show a list of common kewords User will choose a keyword - SECONDWORD Tool will filter all cells in B and look inside the ones which contain FIRSTWORD and SECONDWORD and show a list of common kewords User will choose a keyword - THIRDWORD Tool will finally output : All cell values from A that contain in B FIRSTWORD,SECONDWORD,THIRDWORD, and also output FIRSTWORD,SECONDWORD,THIRDWORD

Functions are with 2words, 3words, 4words, 2words in B and word in C/D etc. Cluserized Cell values will also get removed from the excel file and save elsewhere. End user will continue doing this, on and on until all cell values from A can be clustered under combinations of words.

My problem: As this tool goes through excel file it is clear that last numbers of cells get harder and harder to be cluserized under combinations of words. If at the beginning it can cover let's say 100 values from total of 1000 cells in a single output , at the end it can cover less and less numbers (5-10 cells in a single output). Of course this can be done on and on until all cell values from A are cluserized, but time spent on doing this gets longer.(you can spend as much as 2 hours on 700 cells(this one's being easier to cluster) out of 1k and 4 hours on the last 300 cells (this one's beeing harder to cluster)). I'm seeking for some ideas in order to speed up this final part. How or what cand I do to clusterize these more "complex" cells ?

I will post hese a sample of a function, but I'm no really looking for help with my code, I need ideas on how to speed up/automate clusterization :

data = openpyxl.load_workbook('test.xlsx')
sheet = data['ONE1']

big_data = []
big_data_new = []
stop_words = ['de','para','los','con','la','el','del','por','and','for','y','en','a']

def output_asins_2words(word_for_find1,word_for_find2):
    itt = 0
    empty_list = []
    for cell in range(1, sheet.max_row + 1):
        s = sheet.cell(row=cell, column=2).value
        if s != None:
            empty_list.append(s)
            for i in range(len(empty_list)):
                x = empty_list[i].lower()
            if s != None and word_for_find1.lower() in x:
                if word_for_find2.lower() in x:
                    a = sheet.cell(row=cell, column=1).value
                    asin_list.append(a)
                    itt += 1


        empty_list.clear()

    print('found cells: ', itt)
    print(*asin_list, sep='\n')

def find_freq1():
    for cell in range(1, 3000):
        data = sheet.cell(row=cell, column=2).value
        if data is not None:
            data=data.lower()
            data_no_pct = data.translate(str.maketrans('', '', string.punctuation))
            big_data.append(data_no_pct)
    x = " ".join(big_data)
    split_it = x.split()
    for stopwrd in split_it:
        if stopwrd not in stop_words:
            big_data_new.append(stopwrd)
    y = ' '.join(big_data_new)
    split_fin = y.split()
    Count = Counter(split_fin)
    most_occur = Count.most_common(100)
    print(most_occur)
    big_data.clear()
    big_data_new.clear()

def find_freq2(var_first_word):
    for cell in range(1, 3000):
        data = sheet.cell(row=cell, column=2).value
        if data is not None:
            data = data.lower()
            if var_first_word in data:
                data_no_pct = data.translate(str.maketrans('', '', string.punctuation))
                big_data.append(data_no_pct)
    x = " ".join(big_data)
    split_it = x.split()
    for stopwrd in split_it:
        if stopwrd not in stop_words:
            big_data_new.append(stopwrd)
    y = ' '.join(big_data_new)
    split_fin = y.split()
    Count = Counter(split_fin)
    most_occur = Count.most_common(100)
    print(most_occur)
    big_data.clear()
    big_data_new.clear()


def two_x_words():
    find_freq1()
    firstword = input('Please type word from list:  ')
    print('FirstWord is: ' + firstword)
    find_freq2(firstword)
    secondWord = input('Please type word from list:  ')
    print('SecondWord  is: ' + secondWord)
    print('Words are: ' ,firstword,' and ',secondWord)
    print('\n Part Numbers are: ')
    output_asins_2words(firstword,secondWord)



two_x_words()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Automate clusterization using python and Excel

Sources

Related Questions