'Is there a way to simplify my gff file? Converting gff file to MCSCANX input
I got the following gff file:
> sl1 FUN_000001 15679 15897 sl1 FUN_000001 15952 17031
> sl1 FUN_000001 17086 17316 sl1 FUN_000001 17371 17454
> sl1 FUN_000001 17508 17702 sl1 FUN_000001 15679 15897
> sl1 FUN_000001 15952 17031 sl1 FUN_000001 17086 17316
> sl1 FUN_000001 17371 17454 sl1 FUN_000001 17508 17702
> sl1 FUN_000002 26991 27390 sl1 FUN_000002 26991 27390
> sl1 FUN_000002 26991 27051 sl1 FUN_000002 27104 27390
> sl1 FUN_000002 26991 27051 sl1 FUN_000002 27104 27390
> sl1 FUN_000003 31856 32689 sl1 FUN_000003 31856 32689
> sl1 FUN_000003 32432 32689 sl1 FUN_000003 31856 32365
> sl1 FUN_000003 32432 32689 sl1 FUN_000003 31856 32365
> sl1 FUN_000004 34247 35148 sl1 FUN_000004 34247 35148
> sl1 FUN_000004 34856 35148 sl1 FUN_000004 34247 34802
> sl1 FUN_000004 34856 35148 sl1 FUN_000004 34247 34802
> sl1 FUN_000005 38975 39306 sl1 FUN_000005 38975 39306
> sl1 FUN_000005 38975 39001 sl1 FUN_000005 39064 39306
> sl1 FUN_000005 38975 39001 sl1 FUN_000005 39064 39306
I need to get only one gene (FUN_*****) with the minor lenght and the major lenght. For example, for gene FUN_000001:
sl1 FUN_000001 15679 15897
sl1 FUN_000001 15952 17031
sl1 FUN_000001 17086 17316
sl1 FUN_000001 17371 17454
sl1 FUN_000001 17508 17702
sl1 FUN_000001 15679 15897
sl1 FUN_000001 15952 17031
sl1 FUN_000001 17086 17316
sl1 FUN_000001 17371 17454
sl1 FUN_000001 17508 17702
my output must be :
sl1 FUN_000001 15679 17702
Itried to use the drop_duplicates in python but only permits to get the first or the last row.
Could anyone help me?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
