'How to select a row by name and also the previous row in bash or python?
Imagine that we have this data:
##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
And the final output would be to get the rows that contain the word 'transmembrane' and its corresponding top row only:
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
I am trying with grep but I am a little bit stuck
Thanks!
Solution 1:[1]
You might use python for this task following way, let file.txt content be
##sequence-region P51451 1 505
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q01518 1 475
##sequence-region Q96MP8 1 289
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P48059 1 325
##sequence-region Q9UHB6 1 759
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
then create file gettransmembrane.py holding
import fileinput
for line in fileinput.input():
if "Transmembrane" in line:
print(prevline,end="")
print(line,end="")
prevline = line
then
python gettransmembrane.py file.txt
output
##sequence-region P22223 1 829
P22223 UniProtKB Transmembrane 655 677 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region Q9HCJ2 1 640
Q9HCJ2 UniProtKB Transmembrane 528 548 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
##sequence-region P16581 1 610
P16581 UniProtKB Transmembrane 557 578 . . . Note=Helical;Ontology_term=ECO:0000255;evidence=ECO:0000255
Explanation: fileinput is module from python standard library(1), for each line I do print it and previous line if it does has Transmembrane substring, note that prevline = line is done after printing. I do specify empty strs as ends because lines already have newlines at their ends.
(1) if you are limited to processing one file which name you know in advance you might elect to use simple file reading using open, using fileinput allows you to use more than 1 file (akin to cat command) or using stdin, so if you have above as output of another command you do not have to make temporary file, but can do pipe output of said command into python gettransmembrane.py
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Daweo |
