'How to include all columns that contain a suffix?

Imagine you have the following dfs: CTA

| CTA  | sheet_name_left | tabel_id_links      | sheet_name_right | tabel_id_rechts     | Method | file_name_left                                        | file_name_right               |
|------|-----------------|---------------------|------------------|---------------------|--------|-------------------------------------------------------|-------------------------------|
| CTA1 | SoD Analysis Q2 | User ID             | EKKO             | Created By          | Left   | Input_controls/SoD Analysis Q2 2021.xlsx | Input_controls/SOD-16 Q2.xlsb |
| CTA1 | EKKO            | Purchasing Document | EKPO             | Purchasing Document | Left   | Input_controls/SOD-16 Q2.xlsb                         | Input_controls/SOD-16 Q2.xlsb |
| CTA1 | EKPO            | Article             | A996             | Article             | Left   | Input_controls/SOD-16 Q2.xlsb                         | Input_controls/SOD-16 Q2.xlsb |
| CTA2 | SoD Analysis Q2 | User ID             | KONH             | Created By          | Left   | Input_controls/SoD Analysis Q2 2021.xlsx | Input_controls/SOD-16 Q2.xlsb |

SODs

| SOD naam | CTA Links | tabel_id_links       | CTA Rechts | tabel_id_rechts      |
|----------|-----------|----------------------|------------|----------------------|
| SOD16    | CTA1      | Condition record no. | CTA2       | Condition record no. |

These values are pointing to files and respective sheet names. Furthermore, there is a tabel_id_right/tabel_id_left that defines the values on which the tables should be joined.

Furthermore I have a listkeep where I specify the columns I would like to keep (there are a lot of columns in these sheets so merging them will give a 60gb file):

listkeep = ["Access Risk ID", 'User Group', 'User Name', 'Condition record no.', 'Created By', 'User ID',
            'Purchasing Document', 'Article', 'Created By', 'Condition record no.']

Now what I would like to do is merge the columns above based on their tabel_id above. First we are joining CTA dfs and then comparing CTA1 and CTA2 files to determine whether users have made conflicts.

import pandas as pd

SODs = pd.read_excel(r"input_control.xlsx", sheet_name="SODs")
CTAs = pd.read_excel(r"input_control.xlsx", sheet_name="CTA")
userAnalysis = pd.read_excel(r"flat_user_data.xlsx", sheet_name='GRC User Data Clean Up')
listkeep = ["Access Risk ID", 'User Group', 'User Name', 'Condition record no.', 'Created By', 'User ID',
            'Purchasing Document', 'Article', 'Created By', 'Condition record no.']

for index, row in SODs.iterrows():
    [SOD, CTALinks, CTARechts] = [row['SOD naam'], row['CTA Links'], row['CTA Rechts']]


def handleCTA(CTAName, possibleExceptionsThisSOD):
    CTA = CTAs[CTAs['CTA'] == CTAName]
    for i, row in CTA.iterrows():
        if (CTA.loc[i, "file_name_left"] == "Input_controls/SoD Analysis Q2 2021.xlsx"):
            file_left = possibleExceptionsThisSOD
        else:
            file_left = pd.read_excel(CTA['file_name_left'][i],
                                      sheet_name=CTA['sheet_name_left'][i])

        file_right = pd.read_excel(CTA['file_name_right'][i],
                                   sheet_name=CTA['sheet_name_right'][i])
        print(locals())
        if 'df' in locals():
            print(df)
            print(True)
            df = pd.merge(df, file_right, how=str(CTA['Method'][i]).lower(), left_on=CTA['tabel_id_links'][i],
                          right_on=CTA['tabel_id_rechts'][i],
                          suffixes=("_LEFT_SUFFIX", "_RIGHT_SUFFIX"))
            listkeep.extend(CTA['tabel_id_links'][i])
            listkeep.extend(CTA['tabel_id_rechts'][i])
            listkeep.extend([t+"_LEFT_SUFFIX" for t in listkeep])
            listkeep.extend([t+"_RIGHT_SUFFIX" for t in listkeep])
            print(listkeep)
            df.drop(columns=df.columns.difference(listkeep), inplace=True)
            query = str(CTA['tabel_id_links'][i])
            # search column names
            l1 = [c for c in df.columns if query in c]
            listkeep.extend(l1)
            query = str(CTA['tabel_id_rechts'][i])
            l1 = [c for c in df.columns if query in c]
            listkeep.extend(l1)
        else:
            print(False)
            df = pd.merge(file_left, file_right, how=str(CTA['Method'][i]).lower(), left_on=CTA['tabel_id_links'][i],
                          right_on=CTA['tabel_id_rechts'][i],
                          suffixes=("_LEFT_SUFFIX", "_RIGHT_SUFFIX"))
            listkeep.extend([t+"_LEFT_SUFFIX" for t in listkeep])
            listkeep.extend([t+"_RIGHT_SUFFIX" for t in listkeep])
            df.drop(columns=df.columns.difference(listkeep), inplace=True)
    df.to_csv(str(i) + 'output.csv')
    print(listkeep)
    return df


def handleSOD(SOD, CTALinks, CTARechts):
    possibleExceptionsThisSOD = userAnalysis[userAnalysis["Access Risk ID"] == SOD]
    CTAExceptionsLinks = handleCTA(CTALinks, possibleExceptionsThisSOD)
    CTAExceptionsRechts = handleCTA(CTARechts, possibleExceptionsThisSOD)
    df = pd.merge(CTAExceptionsLinks, CTAExceptionsRechts, how='inner',
                  left_on=SODs[SODs["SOD naam"] == SOD]["tabel_id_links"][0],
                  right_on=SODs[SODs["SOD naam"] == SOD]["tabel_id_rechts"][0],
                  suffixes=(str(CTALinks), str(CTARechts)))
    listkeep.extend([t+str(CTALinks) for t in listkeep])
    listkeep.extend([t+str(CTARechts) for t in listkeep])
    import pickle
    with open('filef.txt', 'wb') as fp:
        pickle.dump(listkeep, fp)
    return df


handleSOD(SOD, CTALinks, CTARechts)

Now I want to keep all columns in the merge that contain a suffix. However the above code does not do this. I've tried it with this code

listkeep.extend([t+str(CTALinks) for t in listkeep])
listkeep.extend([t+str(CTARechts) for t in listkeep])

However my output is the following:

User IDCTA1 User GroupCTA1 Access Risk IDCTA1 Purchasing Document Created ByCTA1 Article Condition record no. User IDCTA2 User GroupCTA2 Access Risk IDCTA2 Created ByCTA2
0 ADAMET EU PROD&SOUR SOD16 ADAMET EU PROD&SOUR SOD16
1 ADAMET EU PROD&SOUR SOD16 ADAMET EU PROD&SOUR SOD16
2 ADAMET EU PROD&SOUR SOD16 AMAHAJ EU PROD&SOUR SOD16
3 ADAMET EU PROD&SOUR SOD16 AMAHAJ EU PROD&SOUR SOD16
4 ADAMET EU PROD&SOUR SOD16 AMCHUG EU PROD&SOUR SOD16
5 ADAMET EU PROD&SOUR SOD16 AMCHUG EU PROD&SOUR SOD16
6 ADAMET EU PROD&SOUR SOD16 ANKING EU PROD&SOUR SOD16
7 ADAMET EU PROD&SOUR SOD16 ANKING EU PROD&SOUR SOD16
8 ADAMET EU PROD&SOUR SOD16 ANKLEM INT MARKETS SOD16
9 ADAMET EU PROD&SOUR SOD16 ANKLEM INT MARKETS SOD16
10 ADAMET EU PROD&SOUR SOD16 ANMAGA EU PROD&SOUR SOD16
11 ADAMET EU PROD&SOUR SOD16 ANMAGA EU PROD&SOUR SOD16
12 ADAMET EU PROD&SOUR SOD16 ANNTOR EU PROD&SOUR SOD16
13 ADAMET EU PROD&SOUR SOD16 ANNTOR EU PROD&SOUR SOD16
14 ADAMET EU PROD&SOUR SOD16 ASTEYN EU PROD&SOUR SOD16
15 ADAMET EU PROD&SOUR SOD16 ASTEYN EU PROD&SOUR SOD16
16 ADAMET EU PROD&SOUR SOD16 ATDOESSCHATE EU PROD&SOUR SOD16
17 ADAMET EU PROD&SOUR SOD16 ATDOESSCHATE EU PROD&SOUR SOD16
18 ADAMET EU PROD&SOUR SOD16 ATELEG EU MERCHANDI SOD16

User IDs are not matching.. and I am losing columns..

Desired output is all columns that contain a suffix OR reside in the table_id should be included in the final merge

Please help



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source