'How do I extract certain words with a specific key letters inside a file with python

Sorry, im fairly new to python, never been trained much.

I want to ask how do I extract words with certain key letters inside of a file './models/asm/Draft_km.modelspec' in python for example (these lines can be found inside of the .modelspec file):

m_BSORx_kcat : 10
m_ENTERH_kcat : 10
m_TRPTRS_kcat : 10
m_EX_remnant1_e_kcat : 10
m_SCYSSL_kcat : 10
m_RNMK_kcat : 10
m_TAGtex_kcat : 10
m_URIDK2r_kcat : 10
m_TRPt2rpp_kcat : 10
m_GLUSy_kcat : 10
m_VPAMTr_copy2_kcat : 10
m_EX_galctn__L_e_km : 0.001
m_EX_galt_e_km : 0.001
m_EX_dgmp_e_km : 0.001
m_EX_galur_e_km : 0.001
m_EX_gam_e_km : 0.001
m_EX_gam6p_e_km : 0.001
m_EX_gbbtn_e_km : 0.001

I want to extract these inside a large '.modelspec' file by filtering "_kcat : 10" and be able to obtain them as m_BSORx_kcat : 10, m_ENTERH_kcat : 10, m_TRPTRS_kcat : 10, m_EX_remnant1_e_kcat : 10, m_SCYSSL_kcat : 10, m_RNMK_kcat : 10, m_TAGtex_kcat : 10, m_URIDK2r_kcat : 10, m_TRPt2rpp_kcat : 10, m_GLUSy_kcat : 10, m_VPAMTr_copy2_kcat : 10

My end goal is to be able to randomly reassign 10% of the value (-1,1) to do a genetic algorithm

Much help is appreciated



Solution 1:[1]

Since you seem to be planning to modify the data, it might be useful to first split the lines into a list and then process each line individually.

with open("./models/asm/Draft_km.modelspec") as f:
    # read lines, skipping empty lines and remove trailing whitespace
    lines = [line.rstrip() for line in f if line.strip()]

If all you need to do is check for a substring, you can check each line like so:

for line in lines:
    if "_kcat : 10" in line:
        print(line) # or do whatever you want

If you need to match more complex patterns, regular expressions as in Tim Biegeleisen's answer are the way to go.

Solution 2:[2]

Using re.findall we can try:

# use this to read all lines into a string
with open('./models/asm/Draft_km.modelspec', 'r') as file:
    inp = file.read()

# otherwise we can hard code the data you showed in your question here
inp = """m_BSORx_kcat : 10
m_ENTERH_kcat : 10
m_TRPTRS_kcat : 10
m_EX_remnant1_e_kcat : 10
m_SCYSSL_kcat : 10
m_RNMK_kcat : 10
m_TAGtex_kcat : 10
m_URIDK2r_kcat : 10
m_TRPt2rpp_kcat : 10
m_GLUSy_kcat : 10
m_VPAMTr_copy2_kcat : 10
m_EX_galctn__L_e_km : 0.001
m_EX_galt_e_km : 0.001
m_EX_dgmp_e_km : 0.001
m_EX_galur_e_km : 0.001
m_EX_gam_e_km : 0.001
m_EX_gam6p_e_km : 0.001
m_EX_gbbtn_e_km : 0.001"""

matches = re.findall(r'\b\w+_kcat : \d+(?:\.\d+)?', inp)
output = ', '.join(matches)
print(output)

This prints:

m_BSORx_kcat : 10, m_ENTERH_kcat : 10, m_TRPTRS_kcat : 10, m_EX_remnant1_e_kcat : 10, m_SCYSSL_kcat : 10, m_RNMK_kcat : 10, m_TAGtex_kcat : 10, m_URIDK2r_kcat : 10, m_TRPt2rpp_kcat : 10, m_GLUSy_kcat : 10, m_VPAMTr_copy2_kcat : 10

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 fsimonjetz
Solution 2