'How to alter values in multiple columns based on values in other columns within a loop

I am attempting to alter values in multiple columns based on corresponding values in other columns. I have been able to do this by hard coding, but I would appreciate any help in automating the following code so it can be replicated for any number of samples. Below, I share a minimal example input, ideal output and the working code. Note - I am still a bit green in python so comments go a long way.

Input-

01_s_IDX_type   01_s_IDX  02_s_IDY_type   02_s_IDY
HET           0/1:10,9:19:99:202,0,244   HET   0/1:18,1:19:99:202,0,244
HOM           0/1:20,0:20:99:202,0,244   HOM   0/1:50,0:50:99:202,0,244

Here, values from the IDX column are used to re-value the IDX_type columns. The information of interest are the 3rd and 4th integers in the IDX columns: 10,9 and 18,1. For sample 01 the ratio between 10:9 is between 0.7-1.3 so its type can stay as HET. For sample 02 the ratio between 18:1 is not between 0.7-1.3 so it's type is changed to REF.

Output-

01_s_IDX_type   01_s_IDX  02_s_IDY_type   02_s_IDY
HET           0/1:10,9:19:99:202,0,244   REF 0/1:18,1:19:99:202,0,244
HOM           0/1:20,0:20:99:202,0,244   HOM 0/1:50,0:50:99:202,0,244

Here is the code that achieved this.

#Create toy example
df = {'01_s_IDX_type':  ['HET', 'HOM'],
    '01_s_IDX': ['0/1:10,9:19:99:202,0,244', '0/1:20,0:20:99:202,0,244'],
    '02_s_IDX_type': ['REF', 'HOM'],
    '02_s_IDX': ['0/1:18,1:19:99:202,0,244', '0/1:0,50:50:99:202,0,244']
    }
df = pd.DataFrame(df)
print (df)

#create new dfs for each sample
df_01, df_02 = df.filter(regex=r'^01'), df.filter(regex=r'^02')

#make copy of the info column
df_01_copy = df_01['01_s_IDX']
df_02_copy = df_02['02_s_IDX']

#remove unneeded parts of the column (first four characters)
df_01_copy = df_01_copy.str[4:]
df_02_copy = df_02_copy.str[4:]

#replace all commas with colons
df_01_copy = df_01_copy.replace(to_replace =',', value = ':', regex = True)
df_02_copy = df_02_copy.replace(to_replace =',', value = ':', regex = True)

#split into new columns by :
df_01_copy = df_01_copy.str.split(pat=':',expand=True)
df_02_copy = df_02_copy.str.split(pat=':',expand=True)

#keep first two columns
df_01_copy = df_01_copy.iloc[:,:2]
df_02_copy = df_02_copy.iloc[:,:2]

#rename columns
df_01_copy.columns = ['DP1', 'DP2']
df_02_copy.columns = ['DP1', 'DP2']

#convert to numeric, calculate ratios and add the ratios to OG dfs
df_01_copy = df_01_copy.apply(pd.to_numeric)
df_01['ratio'] = df_01_copy.DP1.div(df_01_copy.DP2)
df_02_copy = df_02_copy.apply(pd.to_numeric)
df_02['ratio'] = df_02_copy.DP1.div(df_02_copy.DP2)

#Keep HET if ratio is between 1.3-0.7, if ratio = 0 then HOM
df_01.loc[(df_01['ratio'] > 1.3), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] < 0.7), '01_s_IDX_type'] = 'REF'
df_01.loc[(df_01['ratio'] == 0), '01_s_IDX_type'] = 'HOM'
df_02.loc[(df_02['ratio'] > 1.3), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] < 0.7), '02_s_IDX_type'] = 'REF'
df_02.loc[(df_02['ratio'] == 0 ), '02_s_IDX_type'] = 'HOM'

#Rejoin
df_het = pd.concat([df_01, df_02, axis=1, join="outer")
df_out = df_het.drop('ratio', axis=1)

I have datasets which may consist of n samples, so turning this code into a pipeline/ function would be ideal. Thanks in advance for any help on this.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source