'How to select a row by name and also the previous row in bash or python?

Imagine that we have this data:

##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

And the final output would be to get the rows that contain the word 'transmembrane' and its corresponding top row only:

##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

I am trying with grep but I am a little bit stuck

Thanks!



Solution 1:[1]

You might use python for this task following way, let file.txt content be

##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255 
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

then create file gettransmembrane.py holding

import fileinput
for line in fileinput.input():
    if "Transmembrane" in line:
        print(prevline,end="")
        print(line,end="")
    prevline = line

then

python gettransmembrane.py file.txt

output

##sequence-region P22223 1 829
P22223  UniProtKB   Transmembrane   655 677 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2  UniProtKB   Transmembrane   528 548 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581  UniProtKB   Transmembrane   557 578 .   .   .   Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255

Explanation: fileinput is module from python standard library(1), for each line I do print it and previous line if it does has Transmembrane substring, note that prevline = line is done after printing. I do specify empty strs as ends because lines already have newlines at their ends.

(1) if you are limited to processing one file which name you know in advance you might elect to use simple file reading using open, using fileinput allows you to use more than 1 file (akin to cat command) or using stdin, so if you have above as output of another command you do not have to make temporary file, but can do pipe output of said command into python gettransmembrane.py

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Daweo