'How to include all columns that contain a suffix?
Imagine you have the following dfs: CTA
| CTA | sheet_name_left | tabel_id_links | sheet_name_right | tabel_id_rechts | Method | file_name_left | file_name_right |
|------|-----------------|---------------------|------------------|---------------------|--------|-------------------------------------------------------|-------------------------------|
| CTA1 | SoD Analysis Q2 | User ID | EKKO | Created By | Left | Input_controls/SoD Analysis Q2 2021.xlsx | Input_controls/SOD-16 Q2.xlsb |
| CTA1 | EKKO | Purchasing Document | EKPO | Purchasing Document | Left | Input_controls/SOD-16 Q2.xlsb | Input_controls/SOD-16 Q2.xlsb |
| CTA1 | EKPO | Article | A996 | Article | Left | Input_controls/SOD-16 Q2.xlsb | Input_controls/SOD-16 Q2.xlsb |
| CTA2 | SoD Analysis Q2 | User ID | KONH | Created By | Left | Input_controls/SoD Analysis Q2 2021.xlsx | Input_controls/SOD-16 Q2.xlsb |
SODs
| SOD naam | CTA Links | tabel_id_links | CTA Rechts | tabel_id_rechts |
|----------|-----------|----------------------|------------|----------------------|
| SOD16 | CTA1 | Condition record no. | CTA2 | Condition record no. |
These values are pointing to files and respective sheet names. Furthermore, there is a tabel_id_right/tabel_id_left that defines the values on which the tables should be joined.
Furthermore I have a listkeep where I specify the columns I would like to keep (there are a lot of columns in these sheets so merging them will give a 60gb file):
listkeep = ["Access Risk ID", 'User Group', 'User Name', 'Condition record no.', 'Created By', 'User ID',
'Purchasing Document', 'Article', 'Created By', 'Condition record no.']
Now what I would like to do is merge the columns above based on their tabel_id above. First we are joining CTA dfs and then comparing CTA1 and CTA2 files to determine whether users have made conflicts.
import pandas as pd
SODs = pd.read_excel(r"input_control.xlsx", sheet_name="SODs")
CTAs = pd.read_excel(r"input_control.xlsx", sheet_name="CTA")
userAnalysis = pd.read_excel(r"flat_user_data.xlsx", sheet_name='GRC User Data Clean Up')
listkeep = ["Access Risk ID", 'User Group', 'User Name', 'Condition record no.', 'Created By', 'User ID',
'Purchasing Document', 'Article', 'Created By', 'Condition record no.']
for index, row in SODs.iterrows():
[SOD, CTALinks, CTARechts] = [row['SOD naam'], row['CTA Links'], row['CTA Rechts']]
def handleCTA(CTAName, possibleExceptionsThisSOD):
CTA = CTAs[CTAs['CTA'] == CTAName]
for i, row in CTA.iterrows():
if (CTA.loc[i, "file_name_left"] == "Input_controls/SoD Analysis Q2 2021.xlsx"):
file_left = possibleExceptionsThisSOD
else:
file_left = pd.read_excel(CTA['file_name_left'][i],
sheet_name=CTA['sheet_name_left'][i])
file_right = pd.read_excel(CTA['file_name_right'][i],
sheet_name=CTA['sheet_name_right'][i])
print(locals())
if 'df' in locals():
print(df)
print(True)
df = pd.merge(df, file_right, how=str(CTA['Method'][i]).lower(), left_on=CTA['tabel_id_links'][i],
right_on=CTA['tabel_id_rechts'][i],
suffixes=("_LEFT_SUFFIX", "_RIGHT_SUFFIX"))
listkeep.extend(CTA['tabel_id_links'][i])
listkeep.extend(CTA['tabel_id_rechts'][i])
listkeep.extend([t+"_LEFT_SUFFIX" for t in listkeep])
listkeep.extend([t+"_RIGHT_SUFFIX" for t in listkeep])
print(listkeep)
df.drop(columns=df.columns.difference(listkeep), inplace=True)
query = str(CTA['tabel_id_links'][i])
# search column names
l1 = [c for c in df.columns if query in c]
listkeep.extend(l1)
query = str(CTA['tabel_id_rechts'][i])
l1 = [c for c in df.columns if query in c]
listkeep.extend(l1)
else:
print(False)
df = pd.merge(file_left, file_right, how=str(CTA['Method'][i]).lower(), left_on=CTA['tabel_id_links'][i],
right_on=CTA['tabel_id_rechts'][i],
suffixes=("_LEFT_SUFFIX", "_RIGHT_SUFFIX"))
listkeep.extend([t+"_LEFT_SUFFIX" for t in listkeep])
listkeep.extend([t+"_RIGHT_SUFFIX" for t in listkeep])
df.drop(columns=df.columns.difference(listkeep), inplace=True)
df.to_csv(str(i) + 'output.csv')
print(listkeep)
return df
def handleSOD(SOD, CTALinks, CTARechts):
possibleExceptionsThisSOD = userAnalysis[userAnalysis["Access Risk ID"] == SOD]
CTAExceptionsLinks = handleCTA(CTALinks, possibleExceptionsThisSOD)
CTAExceptionsRechts = handleCTA(CTARechts, possibleExceptionsThisSOD)
df = pd.merge(CTAExceptionsLinks, CTAExceptionsRechts, how='inner',
left_on=SODs[SODs["SOD naam"] == SOD]["tabel_id_links"][0],
right_on=SODs[SODs["SOD naam"] == SOD]["tabel_id_rechts"][0],
suffixes=(str(CTALinks), str(CTARechts)))
listkeep.extend([t+str(CTALinks) for t in listkeep])
listkeep.extend([t+str(CTARechts) for t in listkeep])
import pickle
with open('filef.txt', 'wb') as fp:
pickle.dump(listkeep, fp)
return df
handleSOD(SOD, CTALinks, CTARechts)
Now I want to keep all columns in the merge that contain a suffix. However the above code does not do this. I've tried it with this code
listkeep.extend([t+str(CTALinks) for t in listkeep])
listkeep.extend([t+str(CTARechts) for t in listkeep])
However my output is the following:
| User IDCTA1 | User GroupCTA1 | Access Risk IDCTA1 | Purchasing Document | Created ByCTA1 | Article | Condition record no. | User IDCTA2 | User GroupCTA2 | Access Risk IDCTA2 | Created ByCTA2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ADAMET | EU PROD&SOUR | SOD16 | ADAMET | EU PROD&SOUR | SOD16 | |||||
| 1 | ADAMET | EU PROD&SOUR | SOD16 | ADAMET | EU PROD&SOUR | SOD16 | |||||
| 2 | ADAMET | EU PROD&SOUR | SOD16 | AMAHAJ | EU PROD&SOUR | SOD16 | |||||
| 3 | ADAMET | EU PROD&SOUR | SOD16 | AMAHAJ | EU PROD&SOUR | SOD16 | |||||
| 4 | ADAMET | EU PROD&SOUR | SOD16 | AMCHUG | EU PROD&SOUR | SOD16 | |||||
| 5 | ADAMET | EU PROD&SOUR | SOD16 | AMCHUG | EU PROD&SOUR | SOD16 | |||||
| 6 | ADAMET | EU PROD&SOUR | SOD16 | ANKING | EU PROD&SOUR | SOD16 | |||||
| 7 | ADAMET | EU PROD&SOUR | SOD16 | ANKING | EU PROD&SOUR | SOD16 | |||||
| 8 | ADAMET | EU PROD&SOUR | SOD16 | ANKLEM | INT MARKETS | SOD16 | |||||
| 9 | ADAMET | EU PROD&SOUR | SOD16 | ANKLEM | INT MARKETS | SOD16 | |||||
| 10 | ADAMET | EU PROD&SOUR | SOD16 | ANMAGA | EU PROD&SOUR | SOD16 | |||||
| 11 | ADAMET | EU PROD&SOUR | SOD16 | ANMAGA | EU PROD&SOUR | SOD16 | |||||
| 12 | ADAMET | EU PROD&SOUR | SOD16 | ANNTOR | EU PROD&SOUR | SOD16 | |||||
| 13 | ADAMET | EU PROD&SOUR | SOD16 | ANNTOR | EU PROD&SOUR | SOD16 | |||||
| 14 | ADAMET | EU PROD&SOUR | SOD16 | ASTEYN | EU PROD&SOUR | SOD16 | |||||
| 15 | ADAMET | EU PROD&SOUR | SOD16 | ASTEYN | EU PROD&SOUR | SOD16 | |||||
| 16 | ADAMET | EU PROD&SOUR | SOD16 | ATDOESSCHATE | EU PROD&SOUR | SOD16 | |||||
| 17 | ADAMET | EU PROD&SOUR | SOD16 | ATDOESSCHATE | EU PROD&SOUR | SOD16 | |||||
| 18 | ADAMET | EU PROD&SOUR | SOD16 | ATELEG | EU MERCHANDI | SOD16 |
User IDs are not matching.. and I am losing columns..
Desired output is all columns that contain a suffix OR reside in the table_id should be included in the final merge
Please help
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
