'From a string with repeated under scores (e.g. 1_2_3_4_5_6), split and select 3_4

The header of my data frame looks like this

header = list(data_no_control.columns.values)
header

['MLID_D_08_NGS_34_H08.fsa',
 'MLID_D_25_NGS_38_A11.fsa',
 'MLID_D_36_NGS_41_D12.fsa',
 'MLID_D_37_NGS_42_E12.fsa']

I want to change my header to look like this

['NGS_34',
 'NGS_38',
 'NGS_41',
 'NGS_42']

How can I do this?



Solution 1:[1]

header = ['MLID_D_08_NGS_34_H08.fsa',
 'MLID_D_25_NGS_38_A11.fsa',
 'MLID_D_36_NGS_41_D12.fsa',
 'MLID_D_37_NGS_42_E12.fsa']

new_header = []

for item in header:
    item = item.split('_')
    new_header.append(item[3] + '_' + item[4])

# output: ['NGS_34', 'NGS_38', 'NGS_41', 'NGS_42']
print(new_header)  

Solution 2:[2]

Using str.extract:

df["col"] = df["col"].str.extract(r'_([^_]+_[^_]+)_[^_]+\.\w+$')

Here is a regex demo.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Captain Caveman
Solution 2 Tim Biegeleisen